8309130: x86_64 AVX512 intrinsics for Arrays.sort methods (int, long, float and double arrays) #14227

vamsi-parasa · 2023-05-30T18:54:50Z

The goal is to develop faster sort routines for x86_64 CPUs by taking advantage of AVX512 instructions. This enhancement provides an order of magnitude speedup for Arrays.sort() using int, long, float and double arrays.

This PR shows upto ~7x improvement for 32-bit datatypes (int, float) and upto ~4.5x improvement for 64-bit datatypes (long, double) as shown in the performance data below.

Arrays.sort performance data using JMH benchmarks for arrays with random data

Arrays.sort benchmark	Array Size	Baseline (us/op)	AVX512 Sort (us/op)	Speedup
ArraysSort.doubleSort	10	0.034	0.035	1.0
ArraysSort.doubleSort	25	0.116	0.089	1.3
ArraysSort.doubleSort	50	0.282	0.291	1.0
ArraysSort.doubleSort	75	0.474	0.358	1.3
ArraysSort.doubleSort	100	0.654	0.623	1.0
ArraysSort.doubleSort	1000	9.274	6.331	1.5
ArraysSort.doubleSort	10000	323.339	71.228	4.5
ArraysSort.doubleSort	100000	4471.871	1002.748	4.5
ArraysSort.doubleSort	1000000	51660.742	12921.295	4.0
ArraysSort.floatSort	10	0.045	0.046	1.0
ArraysSort.floatSort	25	0.103	0.084	1.2
ArraysSort.floatSort	50	0.285	0.33	0.9
ArraysSort.floatSort	75	0.492	0.346	1.4
ArraysSort.floatSort	100	0.597	0.326	1.8
ArraysSort.floatSort	1000	9.811	5.294	1.9
ArraysSort.floatSort	10000	323.955	50.547	6.4
ArraysSort.floatSort	100000	4326.38	731.152	5.9
ArraysSort.floatSort	1000000	52413.88	8409.193	6.2
ArraysSort.intSort	10	0.033	0.033	1.0
ArraysSort.intSort	25	0.086	0.051	1.7
ArraysSort.intSort	50	0.236	0.151	1.6
ArraysSort.intSort	75	0.416	0.332	1.3
ArraysSort.intSort	100	0.63	0.521	1.2
ArraysSort.intSort	1000	10.518	4.698	2.2
ArraysSort.intSort	10000	309.659	42.518	7.3
ArraysSort.intSort	100000	4130.917	573.956	7.2
ArraysSort.intSort	1000000	49876.307	6712.812	7.4
ArraysSort.longSort	10	0.036	0.037	1.0
ArraysSort.longSort	25	0.094	0.08	1.2
ArraysSort.longSort	50	0.218	0.227	1.0
ArraysSort.longSort	75	0.466	0.402	1.2
ArraysSort.longSort	100	0.76	0.58	1.3
ArraysSort.longSort	1000	10.449	6.239	1.7
ArraysSort.longSort	10000	307.074	70.284	4.4
ArraysSort.longSort	100000	4135.651	1049.12	3.9
ArraysSort.longSort	1000000	49559.152	12537.53	4.0

This is the first in the series of PRs related to vectorized sorting. Future planned steps after the integration of this PR:

Enabling AVX2 sort and partitioning for Linux ( based on Intel's x86-simd-sort PR).
Enabling SIMD sort (both AVX512 and AVX2) for Short and Char arrays for length < 1750. (Arrays larger than 1750 already use countingSort which is faster than AVX512 sort.)
Adding Windows support for both AVX512 and AVX2 using assembly stubs.
Enabling SIMD sort (both AVX512 and AVX2) for MemorySegment.

Progress

Change must be properly reviewed (1 review required, with at least 1 Reviewer)
Change must not contain extraneous whitespace
Commit message must refer to an issue

Issue

JDK-8309130: x86_64 AVX512 intrinsics for Arrays.sort methods (int, long, float and double arrays) (Enhancement - P4)

Reviewers

Jatin Bhateja (@jatin-bhateja - Reviewer) ⚠️ Review applies to e63a2aa0
Sandhya Viswanathan (@sviswa7 - Reviewer) ⚠️ Review applies to 9642d852
Paul Sandoz (@PaulSandoz - Reviewer) ⚠️ Review applies to b04cb6c3
Vladimir Kozlov (@vnkozlov - Reviewer)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/14227/head:pull/14227
$ git checkout pull/14227

Update a local copy of the PR:
$ git checkout pull/14227
$ git pull https://git.openjdk.org/jdk.git pull/14227/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 14227

View PR using the GUI difftool:
$ git pr show -t 14227

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/14227.diff

Webrev

Link to Webrev Comment

… float and double arrays)

bridgekeeper · 2023-05-30T18:56:14Z

👋 Welcome back vamsi-parasa! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

vamsi-parasa · 2023-05-30T18:56:59Z

/label hotspot-compiler

openjdk · 2023-05-30T18:57:50Z

@vamsi-parasa
The hotspot-compiler label was successfully added.

mlbridge · 2023-05-30T19:02:40Z

Webrevs

…sort

theRealAph · 2023-05-31T15:25:35Z

What happens to really short arrays? Your patch should include macro benchmarks for e.g. 50 and 10.

vamsi-parasa · 2023-06-01T01:16:22Z

What happens to really short arrays? Your patch should include macro benchmarks for e.g. 50 and 10.

Thanks for the suggestion. Please see the performance for small array sizes below:

Arrays.sort benchmark	Array Size	Baseline (us/op)	AVX512 Sort (us/op)	Speedup
ArraysSort.intSort	10	0.029	0.018	1.6
ArraysSort.intSort	25	0.086	0.032	2.7
ArraysSort.intSort	50	0.236	0.056	4.2
ArraysSort.intSort	75	0.409	0.111	3.7
ArraysSort.longSort	10	0.031	0.033	0.9
ArraysSort.longSort	25	0.09	0.061	1.5
ArraysSort.longSort	50	0.228	0.127	1.8
ArraysSort.longSort	75	0.382	0.28	1.4
ArraysSort.doubleSort	10	0.037	0.043	0.9
ArraysSort.doubleSort	25	0.129	0.066	2.0
ArraysSort.doubleSort	50	0.267	0.115	2.3
ArraysSort.doubleSort	75	0.549	0.219	2.5
ArraysSort.floatSort	10	0.034	0.034	1.0
ArraysSort.floatSort	25	0.088	0.053	1.7
ArraysSort.floatSort	50	0.284	0.077	3.7
ArraysSort.floatSort	75	0.484	0.126	3.8

merykitty · 2023-06-01T02:46:20Z

I notice that

zmm_t ymm_vector<float>::max(zmm_t x, zmm_t y) {
    return _mm256_max_ps(x, y);
}

This is not quite right, Arrays.sort uses the total order imposed by Double.compare to sort the array, while _mm256_max_ps(x, y) does x > y ? x : y which is different.

test/micro/org/openjdk/bench/java/util/ArraysSort.java

franz1981 · 2023-06-01T13:58:48Z

Hi @vamsi-parasa !
Given https://bugs.openjdk.org/browse/JDK-8295496 I have noticed how much important is to add benchmark cases where offset and length parameters change and/or differ from the usual 0 and the whole array length.
Equally important is to warmup with different combinations of them in order to "pollute" the JIT existing decisions, making the compiled method (and stubs) to appear more similar to what users would observe in a real world scenario. Playing with the benchmark parameters like this, together with the advice of @theRealAph to try with small inputs (that matters a lot) would unveil any perf difference with the current impl.
In addition, I understand by https://github.com/openjdk/jdk/pull/14227/files#diff-1929ace9ae6df116e2fa2a718ed3924d9dae9a2daea454ca9a78177c21477aa3R5237 that's still not the case for such, at this implementation stage, hence mine is a wish for the final round impl for this PR. 🙏

Co-authored-by: Andrew Haley <[email protected]>

vamsi-parasa · 2023-06-01T17:55:19Z

I notice that
zmm_t ymm_vector<float>::max(zmm_t x, zmm_t y) {
    return _mm256_max_ps(x, y);
}
This is not quite right, Arrays.sort uses the total order imposed by Double.compare to sort the array, while _mm256_max_ps(x, y) does x > y ? x : y which is different.

Hi @merykitty
The algorithm is working for double as expected (i.e. implementing the total order). For example, for the input below:
double[] arrayUnsorted = {-0.0, Double.NaN, 15.75, Double.POSITIVE_INFINITY, -234.4869, Double.NEGATIVE_INFINITY, +0.0, 100.045};
It's showing the correct output after sorting as expected:
[-Infinity, -234.4869, -0.0, 0.0, 15.75, 100.045, Infinity, NaN]

vamsi-parasa · 2023-06-02T04:14:50Z

Hi @vamsi-parasa ! Given https://bugs.openjdk.org/browse/JDK-8295496 I have noticed how much important is to add benchmark cases where offset and length parameters change and/or differ from the usual 0 and the whole array length. Equally important is to warmup with different combinations of them in order to "pollute" the JIT existing decisions, making the compiled method (and stubs) to appear more similar to what users would observe in a real world scenario. Playing with the benchmark parameters like this, together with the advice of @theRealAph to try with small inputs (that matters a lot) would unveil any perf difference with the current impl. In addition, I understand by https://github.com/openjdk/jdk/pull/14227/files#diff-1929ace9ae6df116e2fa2a718ed3924d9dae9a2daea454ca9a78177c21477aa3R5237 that's still not the case for such, at this implementation stage, hence mine is a wish for the final round impl for this PR. 🙏

Hi @franz1981, thank you for the suggestions! The algorithm was tested to sort only a part of the array with non-zero offsets and length. I will upstream those benchmarks/tests as well.

test/micro/org/openjdk/bench/java/util/ArraysSort.java

src/java.base/share/classes/java/util/Arrays.java

openjdk · 2023-06-23T21:08:42Z

@vamsi-parasa this pull request can not be integrated into master due to one or more merge conflicts. To resolve these merge conflicts and update this pull request you can run the following commands in the local repository for your personal fork:

git checkout avx512sort
git fetch https://git.openjdk.org/jdk.git master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push

openjdk · 2023-10-06T20:15:52Z

@sviswa7 @vamsi-parasa Pushed as commit a4e9168.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

iaroslavski · 2023-10-06T21:04:54Z

Hi @vamsi-parasa,

May be too late but there is one question. We have 2 new methods
private static <A> void sort(Class<?> elemType, A array, ...
private static <A> int[] partition(Class<?> elemType, A array, ...
and 4 methods which are passed to them:
mixedInsertionSort, insertionSort, partitionDualPivot, partitionSinglePivot

Methods partitionDualPivot and partitionDualPivotare @ForceInline,
but mixedInsertionSort' and insertionSort` are not. Is it ok or did we miss something?

Voultapher · 2023-10-07T11:44:00Z

When the numpy news came out I analyzed the intel implementation here https://github.com/Voultapher/sort-research-rs/blob/main/writeup/intel_avx512/text.md. I think it's relevant here as well. I assume the new version of x86-simd-sort contains improvements, I hope they have addressed the easy to hit quadratic runtime.

iaroslavski · 2023-10-07T15:32:08Z

Hi @vamsi-parasa,

If DualPivotQuicksort.java is updated, can you improve the class a little bit?

Please find my suggested changes:

< removed >
Javadoc of int[] partition(A a, int low, ) says 'to be sorted' for paramters a, low and high.
It should be to be partitioned. Please correct.
The same for method private static <A> int[] partition(Class<?> elemType,, update javadoc.
< removed >
And changes from my previous comment: all 4 methods should have @ForceInline and all methods should be without
@ForceInline. Am I right?

I guess new PR should be opened for all mentioned changes. Could you please go ahead?
I appreciate if you help to make sorting more accurate and better.

mlbridge · 2023-10-07T17:54:49Z

Mailing list message from Florian Weimer on hotspot-runtime-dev:

I believe this has introduced a build failure with GCC 12.2 on Debian 12.1:

Building target 'jdk' in configuration '/home/fw/build/jdk'
In file included from /usr/lib/gcc/x86_64-linux-gnu/12/include/immintrin.h:49,
from ?/jdk/src/java.base/linux/native/libsimdsort/avx512-common-qsort.h:60,
from ?/jdk/src/java.base/linux/native/libsimdsort/avx512-32bit-qsort.hpp:31,
from ?/jdk/src/java.base/linux/native/libsimdsort/avx512-linux-qsort.cpp:26:
In function '__m512i _mm512_shuffle_epi32(__m512i, _MM_PERM_ENUM)',
inlined from 'static zmm_vector<int>::zmm_t zmm_vector<int>::shuffle(zmm_t) [with unsigned char mask = 177]' at ?/jdk/src/java.base/linux/native/libsimdsort/avx512-32bit-qsort.hpp:96:36,
inlined from 'zmm_t sort_zmm_32bit(zmm_t) [with vtype = zmm_vector<int>; zmm_t = __vector(8) long long int]' at ?/jdk/src/java.base/linux/native/libsimdsort/avx512-32bit-qsort.hpp:170:27:
/usr/lib/gcc/x86_64-linux-gnu/12/include/avx512fintrin.h:4459:50: error: '__Y' is used uninitialized [-Werror=uninitialized]
4459 | return (__m512i) __builtin_ia32_pshufd512_mask ((__v16si) __A,
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~
4460 | __mask,
| ~~~~~~~
4461 | (__v16si)
| ~~~~~~~~~
4462 | _mm512_undefined_epi32 (),
| ~~~~~~~~~~~~~~~~~~~~~~~~~~
4463 | (__mmask16) -1);
| ~~~~~~~~~~~~~~~
/usr/lib/gcc/x86_64-linux-gnu/12/include/avx512fintrin.h: In function 'zmm_t sort_zmm_32bit(zmm_t) [with vtype = zmm_vector<int>; zmm_t = __vector(8) long long int]':
/usr/lib/gcc/x86_64-linux-gnu/12/include/avx512fintrin.h:206:11: note: '__Y' was declared here
206 | __m512i __Y = __Y;
| ^~~

iaroslavski · 2023-10-07T21:48:17Z

@vamsi-parasa
Please disrard my request to change high -> end. I find out a way to update Java code only.
I updated my previous comment. Only items 2, 3, 5 should be performed.

DanielThomas · 2023-10-08T06:18:13Z

A discussion on Reddit raised that this had the potential to regress sort performance on AMD Zen 4. The poster didn't have access to Zen 4 hardware, so I took a moment to run the benchmark, comparing against a baseline with UseAVX=2 on an AWS m7a.2xlarge:

processor	: 0
vendor_id	: AuthenticAMD
cpu family	: 25
model		: 17
model name	: AMD EPYC 9R14
stepping	: 1
microcode	: 0xa10113e
cpu MHz		: 3673.537
cache size	: 1024 KB
physical id	: 0
siblings	: 8
core id		: 0
cpu cores	: 8
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 16
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext perfctr_core invpcid_single ssbd perfmon_v2 ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx512_bf16 clzero xsaveerptr rdpru wbnoinvd arat avx512vbmi pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid flush_l1d
bugs		: sysret_ss_attrs spectre_v1 spectre_v2 spec_store_bypass
bogomips	: 5200.00
TLB size	: 3584 4K pages
clflush size	: 64
cache_alignment	: 64
address sizes	: 48 bits physical, 48 bits virtual
power management:

$ make test TEST="micro:java.lang.ArraysSort.intSort"

Benchmark            (size)  Mode  Cnt      Score     Error  Units
ArraysSort.intSort       10  avgt    3      0.033 ?   0.001  us/op
ArraysSort.intSort       25  avgt    3      0.067 ?   0.002  us/op
ArraysSort.intSort       50  avgt    3      0.391 ?   0.007  us/op
ArraysSort.intSort       75  avgt    3      0.923 ?   0.041  us/op
ArraysSort.intSort      100  avgt    3      1.292 ?   0.009  us/op
ArraysSort.intSort     1000  avgt    3     24.268 ?   0.078  us/op
ArraysSort.intSort    10000  avgt    3    320.131 ?   0.381  us/op
ArraysSort.intSort   100000  avgt    3   4803.142 ?  63.901  us/op
ArraysSort.intSort  1000000  avgt    3  61457.708 ? 347.446  us/op

$ make test TEST="micro:java.lang.ArraysSort.intSort" MICRO="VM_OPTIONS=-XX:UseAVX=2"

Benchmark            (size)  Mode  Cnt      Score      Error  Units
ArraysSort.intSort       10  avgt    3      0.034 ?    0.001  us/op
ArraysSort.intSort       25  avgt    3      0.091 ?    0.008  us/op
ArraysSort.intSort       50  avgt    3      0.247 ?    0.022  us/op
ArraysSort.intSort       75  avgt    3      0.432 ?    0.005  us/op
ArraysSort.intSort      100  avgt    3      0.560 ?    0.018  us/op
ArraysSort.intSort     1000  avgt    3      8.848 ?    0.530  us/op
ArraysSort.intSort    10000  avgt    3    118.476 ?  478.914  us/op
ArraysSort.intSort   100000  avgt    3   4656.937 ?  110.323  us/op
ArraysSort.intSort  1000000  avgt    3  56598.595 ? 2749.859  us/op

This is explained in the linked comments/commits and this Twitter thread:

PaulSandoz · 2023-10-09T17:44:33Z

@DanielThomas thank you, i was not aware of the limitation on AMD Zen 4. We need to address this and it appears the fix is simple and localized. Thankfully we are early into the JDK 22 development cycle so we have time to shake this out.

PaulSandoz · 2023-10-09T18:15:12Z

Mailing list message from Florian Weimer on hotspot-runtime-dev:

I believe this has introduced a build failure with GCC 12.2 on Debian 12.1:

I believe it due to an issue in GCC 12 that has yet to be resolved:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105593

We (Oracle) currently build with GCC 11.2.0 (https://wiki.openjdk.org/display/Build/Supported+Build+Platforms) so i expect we will encounter the same problem when upgrading to the 12 toolchain.

vamsi-parasa · 2023-10-09T18:52:28Z

Mailing list message from Florian Weimer on hotspot-runtime-dev:

I believe this has introduced a build failure with GCC 12.2 on Debian 12.1:

Building target 'jdk' in configuration '/home/fw/build/jdk' In file included from /usr/lib/gcc/x86_64-linux-gnu/12/include/immintrin.h:49, from ?/jdk/src/java.base/linux/native/libsimdsort/avx512-common-qsort.h:60, from ?/jdk/src/java.base/linux/native/libsimdsort/avx512-32bit-qsort.hpp:31, from ?/jdk/src/java.base/linux/native/libsimdsort/avx512-linux-qsort.cpp:26: In function '__m512i _mm512_shuffle_epi32(__m512i, _MM_PERM_ENUM)', inlined from 'static zmm_vector::zmm_t zmm_vector::shuffle(zmm_t) [with unsigned char mask = 177]' at ?/jdk/src/java.base/linux/native/libsimdsort/avx512-32bit-qsort.hpp:96:36, inlined from 'zmm_t sort_zmm_32bit(zmm_t) [with vtype = zmm_vector; zmm_t = __vector(8) long long int]' at ?/jdk/src/java.base/linux/native/libsimdsort/avx512-32bit-qsort.hpp:170:27: /usr/lib/gcc/x86_64-linux-gnu/12/include/avx512fintrin.h:4459:50: error: '__Y' is used uninitialized [-Werror=uninitialized] 4459 | return (__m512i) __builtin_ia32_pshufd512_mask ((__v16si) __A, | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~ 4460 | __mask, | ~~~~~~~ 4461 | (__v16si) | ~~~~~~~~~ 4462 | _mm512_undefined_epi32 (), | ~~~~~~~~~~~~~~~~~~~~~~~~~~ 4463 | (__mmask16) -1); | ~~~~~~~~~~~~~~~ /usr/lib/gcc/x86_64-linux-gnu/12/include/avx512fintrin.h: In function 'zmm_t sort_zmm_32bit(zmm_t) [with vtype = zmm_vector; zmm_t = __vector(8) long long int]': /usr/lib/gcc/x86_64-linux-gnu/12/include/avx512fintrin.h:206:11: note: '__Y' was declared here 206 | __m512i __Y = __Y; | ^~~

Hello Florian (@fweimer),

As Paul mentioned, this is an issue in GCC 12.
Will reproduce it on my side and issue a workaround PR shortly to address this issue.

Thanks,
Vamsi

vamsi-parasa · 2023-10-09T18:56:43Z

A discussion on Reddit raised that this had the potential to regress sort performance on AMD Zen 4. The poster didn't have access to Zen 4 hardware, so I took a moment to run the benchmark, comparing against a baseline with UseAVX=2 on an AWS m7a.2xlarge:

Hello Daniel (@DanielThomas ),

Thank you for sharing the data. Will issue a new PR to restrict the AVX512 sort to Intel CPUs only for the time being.

Thanks,
Vamsi

vamsi-parasa · 2023-10-09T20:55:54Z

Mailing list message from Florian Weimer on hotspot-runtime-dev:
I believe this has introduced a build failure with GCC 12.2 on Debian 12.1:

I believe it due to an issue in GCC 12 that has yet to be resolved:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105593

We (Oracle) currently build with GCC 11.2.0 (https://wiki.openjdk.org/display/Build/Supported+Build+Platforms) so i expect we will encounter the same problem when upgrading to the 12 toolchain.

Hello Paul (@PaulSandoz) and Florian (@fweimer),

Just did a successful OpenJDK build (without any of the errors reported) using the latest GCC 12 (12.3.1) . Also, reproduced the reported build failure using GCC 12.2.0.
Looks like this issue is fixed in gcc 12.3.1.

Thanks,
Vamsi

iwanowww · 2023-10-11T23:25:22Z

src/java.base/share/classes/java/util/DualPivotQuicksort.java

+    @IntrinsicCandidate
+    @ForceInline
+    private static <A> void sort(Class<?> elemType, A array, long offset, int low, int high, SortOperation<A> so) {
+        so.sort(array, low, high);


I'm late to the party, but how does the fallback implementation work (or is intended to) for off-heap case (null + absolute base address)?

Also, for on-heap case the fallback implementation is equivalent to intrinsified case only when offset points at the 0th element of the array.

@iwanowww Yes, you are late to the party :). The fallback implementation could be similar to the vectorizedMismatch regarding base/offset for non heap case. Please see java.base/share/classes/jdk/internal/foreign/AbstractMemorySegmentImpl.java.
Yes, the fallback implementation for non intrinsic case is kept to be equivalent to what was before the SIMD sort PR. This is done to not affect the fallback performance on other platforms.

The problem with fallback implementation as it is now is that it doesn't take into account the offset. It's completely broken for off-heap case and works for on-heap case only because the implementation always passes primitive array base offset.

himichael · 2023-10-13T07:02:18Z

The goal is to develop faster sort routines for x86_64 CPUs by taking advantage of AVX512 instructions. This enhancement provides an order of magnitude speedup for Arrays.sort() using int, long, float and double arrays.

This PR shows upto ~7x improvement for 32-bit datatypes (int, float) and upto ~4.5x improvement for 64-bit datatypes (long, double) as shown in the performance data below.

Arrays.sort performance data using JMH benchmarks for arrays with random data

Arrays.sort benchmark Array Size Baseline (us/op) AVX512 Sort (us/op) Speedup
ArraysSort.doubleSort 10 0.034 0.035 1.0
ArraysSort.doubleSort 25 0.116 0.089 1.3
ArraysSort.doubleSort 50 0.282 0.291 1.0
ArraysSort.doubleSort 75 0.474 0.358 1.3
ArraysSort.doubleSort 100 0.654 0.623 1.0
ArraysSort.doubleSort 1000 9.274 6.331 1.5
ArraysSort.doubleSort 10000 323.339 71.228 4.5
ArraysSort.doubleSort 100000 4471.871 1002.748 4.5
ArraysSort.doubleSort 1000000 51660.742 12921.295 4.0
ArraysSort.floatSort 10 0.045 0.046 1.0
ArraysSort.floatSort 25 0.103 0.084 1.2
ArraysSort.floatSort 50 0.285 0.33 0.9
ArraysSort.floatSort 75 0.492 0.346 1.4
ArraysSort.floatSort 100 0.597 0.326 1.8
ArraysSort.floatSort 1000 9.811 5.294 1.9
ArraysSort.floatSort 10000 323.955 50.547 6.4
ArraysSort.floatSort 100000 4326.38 731.152 5.9
ArraysSort.floatSort 1000000 52413.88 8409.193 6.2
ArraysSort.intSort 10 0.033 0.033 1.0
ArraysSort.intSort 25 0.086 0.051 1.7
ArraysSort.intSort 50 0.236 0.151 1.6
ArraysSort.intSort 75 0.416 0.332 1.3
ArraysSort.intSort 100 0.63 0.521 1.2
ArraysSort.intSort 1000 10.518 4.698 2.2
ArraysSort.intSort 10000 309.659 42.518 7.3
ArraysSort.intSort 100000 4130.917 573.956 7.2
ArraysSort.intSort 1000000 49876.307 6712.812 7.4
ArraysSort.longSort 10 0.036 0.037 1.0
ArraysSort.longSort 25 0.094 0.08 1.2
ArraysSort.longSort 50 0.218 0.227 1.0
ArraysSort.longSort 75 0.466 0.402 1.2
ArraysSort.longSort 100 0.76 0.58 1.3
ArraysSort.longSort 1000 10.449 6.239 1.7
ArraysSort.longSort 10000 307.074 70.284 4.4
ArraysSort.longSort 100000 4135.651 1049.12 3.9
ArraysSort.longSort 1000000 49559.152 12537.53 4.0
This is the first in the series of PRs related to vectorized sorting. Future planned steps after the integration of this PR:

Enabling AVX2 sort and partitioning for Linux ( based on Intel's x86-simd-sort PR).

Enabling SIMD sort (both AVX512 and AVX2) for Short and Char arrays for length < 1750. (Arrays larger than 1750 already use countingSort which is faster than AVX512 sort.)

Adding Windows support for both AVX512 and AVX2 using assembly stubs.

Enabling SIMD sort (both AVX512 and AVX2) for MemorySegment.

Progress

Change must be properly reviewed (1 review required, with at least 1 Reviewer)

Change must not contain extraneous whitespace

Commit message must refer to an issue

Issue

JDK-8309130: x86_64 AVX512 intrinsics for Arrays.sort methods (int, long, float and double arrays) (Enhancement - P4)

Reviewers

Jatin Bhateja (@jatin-bhateja - Reviewer) ⚠️ Review applies to e63a2aa0

Sandhya Viswanathan (@sviswa7 - Reviewer) ⚠️ Review applies to 9642d852

Paul Sandoz (@PaulSandoz - Reviewer) ⚠️ Review applies to b04cb6c3

Vladimir Kozlov (@vnkozlov - Reviewer)

Reviewing

Using git
Using Skara CLI tools
Using diff file

Webrev

Link to Webrev Comment

hello, I tested this feature in openjdk 22.19, but it didn't seem to work.
my test code is as follows:

public class JDKSort {
    public static void main(String[] args) {
        new JDKSort().sort();
    }

    public void sort() {
        // 100 million pieces of data
        int size = 100000000;
        int[] arr = new int[size];
        java.util.Random rand = new java.util.Random();
        for (int i = 0; i < size; ++i) {
            arr[i] = rand.nextInt();
        }

        long startTime = System.currentTimeMillis();
        java.util.Arrays.sort(arr);
        long elapseTime = System.currentTimeMillis() - startTime;
        System.out.println("elapse time -> " + elapseTime + " ms");
    }
}

test result:

jdk8: 15079 ms
jdk22: 11619 ms

this feature looks like it's merged into 22.19:
x86_64 AVX512 intrinsics for Arrays.sort methods (int, long, float and double arrays)
but it doesn't seem to working, do I need to do anything?

merykitty · 2023-10-13T08:17:58Z

@himichael Please refer to this question for how to correctly benchmark Java code.

himichael · 2023-10-13T10:31:14Z

@himichael Please refer to this question for how to correctly benchmark Java code.

thanks for your reply, but I think this has nothing to do with benchmark.
I sort a random array of int[], use: java.util.Arrays.sort()
in jdk 8, run command:
java JDKSort , the result display: 15079 ms

in open jdk 22.19 , run command:
java --add-modules jdk.incubator.vector JDKSort
the result display: 11619 ms

my question is that this feature should improve performance several times, but it doesn't look like there's much difference between open jdk 22.19 and jdk 8.
is there a problem with my configuration ?

sviswa7 · 2023-10-13T23:00:36Z

@himichael Please try java -Xbatch -XX:-TieredCompilation JDKSort. The method DualPivotQuickSort.sort() needs to be C2 compiled for you to see the benefit.

vamsi-parasa · 2023-10-13T23:59:55Z

my question is that this feature should improve performance several times, but it doesn't look like there's much difference between open jdk 22.19 and jdk 8. is there a problem with my configuration ?

Hello @himichael,

Using your code snippet, please see the output below using the latest JDK and JDK 20 (which does not have AVX512 sort):

JDK 20 (without AVX512 sort):
java -XX:CompileCommand=CompileThresholdScaling,java.util.DualPivotQuicksort::sort,0.0001 -XX:-TieredCompilation JDKSort

elapse time -> 7501 ms

JDK 22 (with AVX512 sort)
java -XX:CompileCommand=CompileThresholdScaling,java.util.DualPivotQuicksort::sort,0.0001 -XX:-TieredCompilation JDKSort
elapse time -> 1607 ms

It shows 4.66x speedup.

himichael · 2023-10-14T03:21:52Z

my question is that this feature should improve performance several times, but it doesn't look like there's much difference between open jdk 22.19 and jdk 8. is there a problem with my configuration ?

Hello @himichael,

Using your code snippet, please see the output below using the latest JDK and JDK 20 (which does not have AVX512 sort):

JDK 20 (without AVX512 sort): java -XX:CompileCommand=CompileThresholdScaling,java.util.DualPivotQuicksort::sort,0.0001 -XX:-TieredCompilation JDKSort

elapse time -> 7501 ms

JDK 22 (with AVX512 sort) java -XX:CompileCommand=CompileThresholdScaling,java.util.DualPivotQuicksort::sort,0.0001 -XX:-TieredCompilation JDKSort elapse time -> 1607 ms

It shows 4.66x speedup.

Hello, @vamsi-parasa
I used the commands you provided, but nothing seems to have changed.
The test procedure as follow:
use JDK 8(without AVX512 sort)

/data/soft/jdk1.8.0_371/bin/javac  JDKSort.java
/data/soft/jdk1.8.0_371/bin/java  JDKSort

elapse time -> 15309 ms

use OpenJDK 22.19(with AVX512 sort)

/data/soft/jdk-22/bin/javac JDKSort.java
/data/soft/jdk-22/bin/java -XX:CompileCommand=CompileThresholdScaling,java.util.DualPivotQuicksort::sort,0.0001 -XX:-TieredCompilation JDKSort
CompileCommand: CompileThresholdScaling java/util/DualPivotQuicksort.sort double CompileThresholdScaling = 0.000100

elapse time -> 11687 ms

Not much seems to have changed.

My JDK info:
OpenJDK 22.19:

/data/soft/jdk-22/bin/java -version
openjdk version "22-ea" 2024-03-19
OpenJDK Runtime Environment (build 22-ea+19-1460)
OpenJDK 64-Bit Server VM (build 22-ea+19-1460, mixed mode, sharing)

JDK 8:

/data/soft/jdk1.8.0_371/bin/java -version
java version "1.8.0_371"
Java(TM) SE Runtime Environment (build 1.8.0_371-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.371-b11, mixed mode)

I tested Intel's x86-simd-sort, my code as follow:

#include <iostream>
#include <vector>
#include <algorithm>
#include <chrono>
#include "src/avx512-32bit-qsort.hpp"

int main() {

    // 100 million records
    const int size = 100000000;
    std::vector<int> random_array(size);

    for (int i = 0; i < size; ++i) {
        random_array[i] = rand();
    }

    auto start_time = std::chrono::steady_clock::now();

    avx512_qsort(random_array.data(), size);

    auto end_time = std::chrono::steady_clock::now();
    auto elapse_time = std::chrono::duration_cast<std::chrono::milliseconds>(end_time - start_time).count();

    std::cout << "elapse time -> " << elapse_time << " ms" << std::endl;
    return 0;
}

compile commands:

g++ -o sort -O3 -mavx512f -mavx512dq sort.cpp

elapse time -> 1151 ms
An order of magnitude performance improvement.

Here is my cpu information:

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
On-line CPU(s) list:   0-7
Thread(s) per core:    1
Core(s) per socket:    1
Socket(s):             8
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 85
Model name:            Intel Xeon Processor (Skylake, IBRS)
Stepping:              4
CPU MHz:               2394.374
BogoMIPS:              4788.74
Hypervisor vendor:     KVM
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              4096K
NUMA node0 CPU(s):     0-7
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 md_clear spec_ctrl

lscpu | grep avx The following instructions are supported:

avx
avx2
avx512f
avx512dq
avx512cd
avx512bw
avx512vl

merykitty · 2023-10-14T05:02:52Z

@himichael What do you mean by this having nothing to do with benchmark. You are trying to execute some code to measure its execution time, which is benchmarking. And you are doing that on only 1 simple function, which makes your benchmark micro.

To be more specific, this is a C2-specific optimisation, so only C2-compiled code is benefitted from it. As a result, you need to have the function compiled BEFORE starting the clock. Typically, this is ensured by executing the function repeatedly for several iterations (the current default value is 20000), starting the clock, executing the function several more times, stopping the clock and calculating the average throughput. As this is quite complex and contains non-trivial caveats, it is recommended to use JMH for microbenchmarks.

vamsi-parasa · 2023-10-15T05:33:06Z

@himichael , could you check if the libsimdsort.so is being loaded by running the command below?
java -Xlog:library --version | grep libsimdsort

Here is my output:
[0.021s][info][library] Loaded library libsimdsort.so, handle 0x00007fa3c801ad80

himichael · 2023-10-15T07:40:06Z

@himichael , could you check if the libsimdsort.so is being loaded by running the command below? java -Xlog:library --version | grep libsimdsort

Here is my output: [0.021s][info][library] Loaded library libsimdsort.so, handle 0x00007fa3c801ad80

Hello, @vamsi-parasa,

I see why, I run command /data/soft/jdk-22/bin/java -Xlog:library -version
results as follows:

[0.071s][info][library] Loaded library libjsvml.so, handle 0x00007f3a5c0223e0
openjdk version "22-ea" 2024-03-19
OpenJDK Runtime Environment (build 22-ea+19-1460)
OpenJDK 64-Bit Server VM (build 22-ea+19-1460, mixed mode, sharing)

There is no information about libsimdsort.so

run /data/soft/jdk-22/bin/java -Xlog:library the results as follows:

[0.044s][info][library] Loaded library libjsvml.so, handle 0x00007f00b40223e0
[0.067s][info][library] Failed to find JNI_OnLoad_nio in library with handle 0x00007f016a4f6150
[0.068s][info][library] Loaded library /data/soft/jdk-22/lib/libnio.so, handle 0x00007f0160196da0
[0.068s][info][library] Found JNI_OnLoad in library with handle 0x00007f0160196da0
[0.069s][info][library] Found Java_sun_nio_fs_UnixNativeDispatcher_init in library with handle 0x00007f0160196da0
[0.069s][info][library] Found Java_sun_nio_fs_UnixNativeDispatcher_getcwd in library with handle 0x00007f0160196da0
[0.069s][info][library] Failed to find JNI_OnLoad_jimage in library with handle 0x00007f016a4f6150
[0.069s][info][library] Loaded library /data/soft/jdk-22/lib/libjimage.so, handle 0x00007f0160005450
[0.069s][info][library] Failed to find JNI_OnLoad in library with handle 0x00007f0160005450
[0.069s][info][library] Failed to find Java_jdk_internal_jimage_NativeImageBuffer_getNativeMap in library with handle 0x00007f0160196da0
[0.069s][info][library] Found Java_jdk_internal_jimage_NativeImageBuffer_getNativeMap in library with handle 0x00007f0160005450

It shows some bad failures, but it doesn't seem to have anything to do with libsimdsort.
I am not using a physical machine, I am using a virtual machine, this virtual machine supports the AVX512 instruction set.
How do I open libsimdsort ?

amittoranainc · 2023-10-24T17:45:34Z

can I use this function for sorting excel file data with multiple columns in it?

murphye · 2024-01-03T21:11:53Z

sing a physical machine, I am using a virtual machine, this virtual machine supports the AVX512 instruction set.
How do I open libsimdsort ?

@himichael Did you ever resolve your issue? I am using JDK22 from SDKMan and also do not have libsimdsort.so:

 java --add-modules jdk.incubator.vector  -Xlog:library   
[0.013s][info][library] Loaded library libjsvml.so, handle 0x00007fc9a40229b0
[0.024s][info][library] Failed to find JNI_OnLoad_nio in library with handle 0x00007fca6b2dc220
[0.024s][info][library] Loaded library /home/eric/.sdkman/candidates/java/22.ea.29-open/lib/libnio.so, handle 0x00007fca6419b9b0
[0.024s][info][library] Found JNI_OnLoad in library with handle 0x00007fca6419b9b0
[0.024s][info][library] Found Java_sun_nio_fs_UnixNativeDispatcher_init in library with handle 0x00007fca6419b9b0
[0.024s][info][library] Found Java_sun_nio_fs_UnixNativeDispatcher_getcwd in library with handle 0x00007fca6419b9b0
[0.025s][info][library] Failed to find JNI_OnLoad_jimage in library with handle 0x00007fca6b2dc220
[0.025s][info][library] Loaded library /home/eric/.sdkman/candidates/java/22.ea.29-open/lib/libjimage.so, handle 0x00007fca64006380
[0.025s][info][library] Failed to find JNI_OnLoad in library with handle 0x00007fca64006380
[0.025s][info][library] Failed to find Java_jdk_internal_jimage_NativeImageBuffer_getNativeMap in library with handle 0x00007fca6419b9b0
[0.025s][info][library] Found Java_jdk_internal_jimage_NativeImageBuffer_getNativeMap in library with handle 0x00007fca64006380

However, I checked my lib directory and libsimdsort.so does indeed exist there. Seems there is some command line arg to java missing or wrong.

tarsa · 2024-11-23T13:41:24Z

The answer for slow performance of AVX512 version of x86-simd-sort on Zen 4 is most probably explained in AMD manuals which could be found at: https://www.amd.com/en/search/documentation/hub.html#q=software%20optimization%20guide%20for%20the%20amd%20microarchitecture&f-amd_document_type=Software%20Optimization%20Guides

Software Optimization Guide for the AMD Zen4 Microarchitecture has following remark in "2.11.2 Code recommendations" chapter:

Avoid the memory destination form of COMPRESS instructions. These forms are implemented using microcode and achieve a lower store bandwidth than their register destination forms which use fastpath macro ops.

Software Optimization Guide for the AMD Zen5 Microarchitecture doesn't have any remark about COMPRESS instructions.

Could you add some code that disables the AVX512 version on Zen4, but keeps it enabled on Zen5 and future Zen architectures?

PaulSandoz · 2024-12-05T00:45:50Z

Could you add some code that disables the AVX512 version on Zen4, but keeps it enabled on Zen5 and future Zen architectures?

Or as you suggest here revert to AVX2. I updated JDK-8317976 with that suggestion, which is simpler to maintain. The HotSpot C++ class VM_Version might need updating to return the Zen version number.

8309130: x86_64 AVX512 intrinsics for Arrays.sort methods (int, long,…

e98e5ef

… float and double arrays)

openjdk bot added rfr Pull request is ready for review hotspot-compiler [email protected] labels May 30, 2023

vamsi-parasa added 2 commits May 30, 2023 12:54

remove libstdc++

923a7ca

Merge branch 'master' of https://git.openjdk.java.net/jdk into avx512…

6d140d5

…sort

xueyuanl mentioned this pull request May 31, 2023

Daily Hacker News 31-05-2023 xueyuanl/daily-hackernews#996

Open

github-actions bot mentioned this pull request Jun 1, 2023

2023-05-31 Hot Posts jiacai2050/mofish#300

Open

fix license

30a50d9

github-actions bot mentioned this pull request Jun 1, 2023

Hacker News Daily Top 30 @2023-06-01 meixger/hackernews-daily#256

Open

theRealAph reviewed Jun 1, 2023

View reviewed changes

test/micro/org/openjdk/bench/java/util/ArraysSort.java Outdated Show resolved Hide resolved

vamsi-parasa and others added 2 commits June 1, 2023 08:52

Update test/micro/org/openjdk/bench/java/util/ArraysSort.java

a7c2b6e

Co-authored-by: Andrew Haley <[email protected]>

fix license in one file

1dc9589

jatin-bhateja reviewed Jun 6, 2023

View reviewed changes

test/micro/org/openjdk/bench/java/util/ArraysSort.java Outdated Show resolved Hide resolved

jatin-bhateja reviewed Jun 6, 2023

View reviewed changes

test/micro/org/openjdk/bench/java/util/ArraysSort.java Outdated Show resolved Hide resolved

jatin-bhateja reviewed Jun 6, 2023

View reviewed changes

test/micro/org/openjdk/bench/java/util/ArraysSort.java Outdated Show resolved Hide resolved

jatin-bhateja reviewed Jun 6, 2023

View reviewed changes

src/java.base/share/classes/java/util/Arrays.java Outdated Show resolved Hide resolved

vamsi-parasa added 2 commits June 8, 2023 10:22

Merge branch 'openjdk:master' into avx512sort

3bd12ec

replace multiple intrinsics with one general intrinsic

53a5309

openjdk bot added the merge-conflict Pull request has merge conflict with target branch label Jun 23, 2023

openjdk bot removed rfr Pull request is ready for review sponsor Pull request is ready to be sponsored labels Oct 6, 2023

volkert-fastned mentioned this pull request Oct 9, 2023

Can x86-simd-sort be backported to Temurin OpenJDK 21 LTS? adoptium/adoptium#251

Closed

vamsi-parasa mentioned this pull request Oct 10, 2023

8317763: Follow-up to AVX512 intrinsics for Arrays.sort() PR #16124

Closed

3 tasks

iwanowww reviewed Oct 11, 2023

View reviewed changes

8309130: x86_64 AVX512 intrinsics for Arrays.sort methods (int, long, float and double arrays) #14227

8309130: x86_64 AVX512 intrinsics for Arrays.sort methods (int, long, float and double arrays) #14227

Conversation

vamsi-parasa commented May 30, 2023 • edited by openjdk bot Loading

Progress

Issue

Reviewers

Reviewing

Webrev

bridgekeeper bot commented May 30, 2023

vamsi-parasa commented May 30, 2023

openjdk bot commented May 30, 2023

mlbridge bot commented May 30, 2023 • edited Loading

Webrevs

theRealAph commented May 31, 2023

vamsi-parasa commented Jun 1, 2023 • edited Loading

merykitty commented Jun 1, 2023

franz1981 commented Jun 1, 2023

vamsi-parasa commented Jun 1, 2023

vamsi-parasa commented Jun 2, 2023

openjdk bot commented Jun 23, 2023

openjdk bot commented Oct 6, 2023

iaroslavski commented Oct 6, 2023 • edited Loading

Voultapher commented Oct 7, 2023 • edited by bridgekeeper bot Loading

iaroslavski commented Oct 7, 2023 • edited Loading

mlbridge bot commented Oct 7, 2023

iaroslavski commented Oct 7, 2023

DanielThomas commented Oct 8, 2023 • edited Loading

PaulSandoz commented Oct 9, 2023

PaulSandoz commented Oct 9, 2023

vamsi-parasa commented Oct 9, 2023

vamsi-parasa commented Oct 9, 2023

vamsi-parasa commented Oct 9, 2023 • edited Loading

iwanowww Oct 11, 2023

Choose a reason for hiding this comment

iwanowww Oct 11, 2023

Choose a reason for hiding this comment

sviswa7 Oct 11, 2023

Choose a reason for hiding this comment

iwanowww Oct 11, 2023

Choose a reason for hiding this comment

himichael commented Oct 13, 2023 • edited by bridgekeeper bot Loading

Progress

Issue

Reviewers

Reviewing

Webrev

merykitty commented Oct 13, 2023

himichael commented Oct 13, 2023

sviswa7 commented Oct 13, 2023

vamsi-parasa commented Oct 13, 2023

himichael commented Oct 14, 2023

merykitty commented Oct 14, 2023

vamsi-parasa commented Oct 15, 2023

himichael commented Oct 15, 2023

amittoranainc commented Oct 24, 2023 • edited by bridgekeeper bot Loading

murphye commented Jan 3, 2024 • edited Loading

tarsa commented Nov 23, 2024

PaulSandoz commented Dec 5, 2024

vamsi-parasa commented May 30, 2023 •

edited by openjdk bot

Loading

mlbridge bot commented May 30, 2023 •

edited

Loading

vamsi-parasa commented Jun 1, 2023 •

edited

Loading

iaroslavski commented Oct 6, 2023 •

edited

Loading

Voultapher commented Oct 7, 2023 •

edited by bridgekeeper bot

Loading

iaroslavski commented Oct 7, 2023 •

edited

Loading

DanielThomas commented Oct 8, 2023 •

edited

Loading

vamsi-parasa commented Oct 9, 2023 •

edited

Loading

himichael commented Oct 13, 2023 •

edited by bridgekeeper bot

Loading

amittoranainc commented Oct 24, 2023 •

edited by bridgekeeper bot

Loading

murphye commented Jan 3, 2024 •

edited

Loading