Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8309130: x86_64 AVX512 intrinsics for Arrays.sort methods (int, long, float and double arrays) #14227

Closed
wants to merge 45 commits into from

Conversation

vamsi-parasa
Copy link
Contributor

@vamsi-parasa vamsi-parasa commented May 30, 2023

The goal is to develop faster sort routines for x86_64 CPUs by taking advantage of AVX512 instructions. This enhancement provides an order of magnitude speedup for Arrays.sort() using int, long, float and double arrays.

This PR shows upto ~7x improvement for 32-bit datatypes (int, float) and upto ~4.5x improvement for 64-bit datatypes (long, double) as shown in the performance data below.

Arrays.sort performance data using JMH benchmarks for arrays with random data

Arrays.sort benchmark Array Size Baseline (us/op) AVX512 Sort (us/op) Speedup
ArraysSort.doubleSort 10 0.034 0.035 1.0
ArraysSort.doubleSort 25 0.116 0.089 1.3
ArraysSort.doubleSort 50 0.282 0.291 1.0
ArraysSort.doubleSort 75 0.474 0.358 1.3
ArraysSort.doubleSort 100 0.654 0.623 1.0
ArraysSort.doubleSort 1000 9.274 6.331 1.5
ArraysSort.doubleSort 10000 323.339 71.228 4.5
ArraysSort.doubleSort 100000 4471.871 1002.748 4.5
ArraysSort.doubleSort 1000000 51660.742 12921.295 4.0
ArraysSort.floatSort 10 0.045 0.046 1.0
ArraysSort.floatSort 25 0.103 0.084 1.2
ArraysSort.floatSort 50 0.285 0.33 0.9
ArraysSort.floatSort 75 0.492 0.346 1.4
ArraysSort.floatSort 100 0.597 0.326 1.8
ArraysSort.floatSort 1000 9.811 5.294 1.9
ArraysSort.floatSort 10000 323.955 50.547 6.4
ArraysSort.floatSort 100000 4326.38 731.152 5.9
ArraysSort.floatSort 1000000 52413.88 8409.193 6.2
ArraysSort.intSort 10 0.033 0.033 1.0
ArraysSort.intSort 25 0.086 0.051 1.7
ArraysSort.intSort 50 0.236 0.151 1.6
ArraysSort.intSort 75 0.416 0.332 1.3
ArraysSort.intSort 100 0.63 0.521 1.2
ArraysSort.intSort 1000 10.518 4.698 2.2
ArraysSort.intSort 10000 309.659 42.518 7.3
ArraysSort.intSort 100000 4130.917 573.956 7.2
ArraysSort.intSort 1000000 49876.307 6712.812 7.4
ArraysSort.longSort 10 0.036 0.037 1.0
ArraysSort.longSort 25 0.094 0.08 1.2
ArraysSort.longSort 50 0.218 0.227 1.0
ArraysSort.longSort 75 0.466 0.402 1.2
ArraysSort.longSort 100 0.76 0.58 1.3
ArraysSort.longSort 1000 10.449 6.239 1.7
ArraysSort.longSort 10000 307.074 70.284 4.4
ArraysSort.longSort 100000 4135.651 1049.12 3.9
ArraysSort.longSort 1000000 49559.152 12537.53 4.0

This is the first in the series of PRs related to vectorized sorting. Future planned steps after the integration of this PR:

  1. Enabling AVX2 sort and partitioning for Linux ( based on Intel's x86-simd-sort PR).
  2. Enabling SIMD sort (both AVX512 and AVX2) for Short and Char arrays for length < 1750. (Arrays larger than 1750 already use countingSort which is faster than AVX512 sort.)
  3. Adding Windows support for both AVX512 and AVX2 using assembly stubs.
  4. Enabling SIMD sort (both AVX512 and AVX2) for MemorySegment.

Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8309130: x86_64 AVX512 intrinsics for Arrays.sort methods (int, long, float and double arrays) (Enhancement - P4)

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/14227/head:pull/14227
$ git checkout pull/14227

Update a local copy of the PR:
$ git checkout pull/14227
$ git pull https://git.openjdk.org/jdk.git pull/14227/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 14227

View PR using the GUI difftool:
$ git pr show -t 14227

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/14227.diff

Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented May 30, 2023

👋 Welcome back vamsi-parasa! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@vamsi-parasa
Copy link
Contributor Author

/label hotspot-compiler

@openjdk openjdk bot added rfr Pull request is ready for review hotspot-compiler [email protected] labels May 30, 2023
@openjdk
Copy link

openjdk bot commented May 30, 2023

@vamsi-parasa
The hotspot-compiler label was successfully added.

@theRealAph
Copy link
Contributor

What happens to really short arrays? Your patch should include macro benchmarks for e.g. 50 and 10.

@vamsi-parasa
Copy link
Contributor Author

vamsi-parasa commented Jun 1, 2023

What happens to really short arrays? Your patch should include macro benchmarks for e.g. 50 and 10.

Thanks for the suggestion. Please see the performance for small array sizes below:

Arrays.sort benchmark Array Size Baseline (us/op) AVX512 Sort (us/op) Speedup
ArraysSort.intSort 10 0.029 0.018 1.6
ArraysSort.intSort 25 0.086 0.032 2.7
ArraysSort.intSort 50 0.236 0.056 4.2
ArraysSort.intSort 75 0.409 0.111 3.7
ArraysSort.longSort 10 0.031 0.033 0.9
ArraysSort.longSort 25 0.09 0.061 1.5
ArraysSort.longSort 50 0.228 0.127 1.8
ArraysSort.longSort 75 0.382 0.28 1.4
ArraysSort.doubleSort 10 0.037 0.043 0.9
ArraysSort.doubleSort 25 0.129 0.066 2.0
ArraysSort.doubleSort 50 0.267 0.115 2.3
ArraysSort.doubleSort 75 0.549 0.219 2.5
ArraysSort.floatSort 10 0.034 0.034 1.0
ArraysSort.floatSort 25 0.088 0.053 1.7
ArraysSort.floatSort 50 0.284 0.077 3.7
ArraysSort.floatSort 75 0.484 0.126 3.8

@merykitty
Copy link
Member

I notice that

zmm_t ymm_vector<float>::max(zmm_t x, zmm_t y) {
    return _mm256_max_ps(x, y);
}

This is not quite right, Arrays.sort uses the total order imposed by Double.compare to sort the array, while _mm256_max_ps(x, y) does x > y ? x : y which is different.

@franz1981
Copy link

Hi @vamsi-parasa !
Given https://bugs.openjdk.org/browse/JDK-8295496 I have noticed how much important is to add benchmark cases where offset and length parameters change and/or differ from the usual 0 and the whole array length.
Equally important is to warmup with different combinations of them in order to "pollute" the JIT existing decisions, making the compiled method (and stubs) to appear more similar to what users would observe in a real world scenario. Playing with the benchmark parameters like this, together with the advice of @theRealAph to try with small inputs (that matters a lot) would unveil any perf difference with the current impl.
In addition, I understand by https://github.com/openjdk/jdk/pull/14227/files#diff-1929ace9ae6df116e2fa2a718ed3924d9dae9a2daea454ca9a78177c21477aa3R5237 that's still not the case for such, at this implementation stage, hence mine is a wish for the final round impl for this PR. 🙏

@vamsi-parasa
Copy link
Contributor Author

I notice that

zmm_t ymm_vector<float>::max(zmm_t x, zmm_t y) {
    return _mm256_max_ps(x, y);
}

This is not quite right, Arrays.sort uses the total order imposed by Double.compare to sort the array, while _mm256_max_ps(x, y) does x > y ? x : y which is different.

Hi @merykitty
The algorithm is working for double as expected (i.e. implementing the total order). For example, for the input below:
double[] arrayUnsorted = {-0.0, Double.NaN, 15.75, Double.POSITIVE_INFINITY, -234.4869, Double.NEGATIVE_INFINITY, +0.0, 100.045};
It's showing the correct output after sorting as expected:
[-Infinity, -234.4869, -0.0, 0.0, 15.75, 100.045, Infinity, NaN]

@vamsi-parasa
Copy link
Contributor Author

Hi @vamsi-parasa ! Given https://bugs.openjdk.org/browse/JDK-8295496 I have noticed how much important is to add benchmark cases where offset and length parameters change and/or differ from the usual 0 and the whole array length. Equally important is to warmup with different combinations of them in order to "pollute" the JIT existing decisions, making the compiled method (and stubs) to appear more similar to what users would observe in a real world scenario. Playing with the benchmark parameters like this, together with the advice of @theRealAph to try with small inputs (that matters a lot) would unveil any perf difference with the current impl. In addition, I understand by https://github.com/openjdk/jdk/pull/14227/files#diff-1929ace9ae6df116e2fa2a718ed3924d9dae9a2daea454ca9a78177c21477aa3R5237 that's still not the case for such, at this implementation stage, hence mine is a wish for the final round impl for this PR. 🙏

Hi @franz1981, thank you for the suggestions! The algorithm was tested to sort only a part of the array with non-zero offsets and length. I will upstream those benchmarks/tests as well.

@openjdk
Copy link

openjdk bot commented Jun 23, 2023

@vamsi-parasa this pull request can not be integrated into master due to one or more merge conflicts. To resolve these merge conflicts and update this pull request you can run the following commands in the local repository for your personal fork:

git checkout avx512sort
git fetch https://git.openjdk.org/jdk.git master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push

@openjdk openjdk bot added the merge-conflict Pull request has merge conflict with target branch label Jun 23, 2023
@openjdk openjdk bot removed rfr Pull request is ready for review sponsor Pull request is ready to be sponsored labels Oct 6, 2023
@openjdk
Copy link

openjdk bot commented Oct 6, 2023

@sviswa7 @vamsi-parasa Pushed as commit a4e9168.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@iaroslavski
Copy link
Contributor

iaroslavski commented Oct 6, 2023

Hi @vamsi-parasa,

May be too late but there is one question. We have 2 new methods
private static <A> void sort(Class<?> elemType, A array, ...
private static <A> int[] partition(Class<?> elemType, A array, ...
and 4 methods which are passed to them:
mixedInsertionSort, insertionSort, partitionDualPivot, partitionSinglePivot

Methods partitionDualPivot and partitionDualPivotare @ForceInline,
but mixedInsertionSort' and insertionSort` are not. Is it ok or did we miss something?

@Voultapher
Copy link

Voultapher commented Oct 7, 2023

When the numpy news came out I analyzed the intel implementation here https://github.com/Voultapher/sort-research-rs/blob/main/writeup/intel_avx512/text.md. I think it's relevant here as well. I assume the new version of x86-simd-sort contains improvements, I hope they have addressed the easy to hit quadratic runtime.

@iaroslavski
Copy link
Contributor

iaroslavski commented Oct 7, 2023

Hi @vamsi-parasa,

If DualPivotQuicksort.java is updated, can you improve the class a little bit?

Please find my suggested changes:

  1. < removed >

  2. Javadoc of int[] partition(A a, int low, ) says 'to be sorted' for paramters a, low and high.
    It should be to be partitioned. Please correct.

  3. The same for method private static <A> int[] partition(Class<?> elemType,, update javadoc.

  4. < removed >

  5. And changes from my previous comment: all 4 methods should have @ForceInline and all methods should be without
    @ForceInline. Am I right?

I guess new PR should be opened for all mentioned changes. Could you please go ahead?
I appreciate if you help to make sorting more accurate and better.

@mlbridge
Copy link

mlbridge bot commented Oct 7, 2023

Mailing list message from Florian Weimer on hotspot-runtime-dev:

I believe this has introduced a build failure with GCC 12.2 on Debian 12.1:

Building target 'jdk' in configuration '/home/fw/build/jdk'
In file included from /usr/lib/gcc/x86_64-linux-gnu/12/include/immintrin.h:49,
from ?/jdk/src/java.base/linux/native/libsimdsort/avx512-common-qsort.h:60,
from ?/jdk/src/java.base/linux/native/libsimdsort/avx512-32bit-qsort.hpp:31,
from ?/jdk/src/java.base/linux/native/libsimdsort/avx512-linux-qsort.cpp:26:
In function '__m512i _mm512_shuffle_epi32(__m512i, _MM_PERM_ENUM)',
inlined from 'static zmm_vector<int>::zmm_t zmm_vector<int>::shuffle(zmm_t) [with unsigned char mask = 177]' at ?/jdk/src/java.base/linux/native/libsimdsort/avx512-32bit-qsort.hpp:96:36,
inlined from 'zmm_t sort_zmm_32bit(zmm_t) [with vtype = zmm_vector<int>; zmm_t = __vector(8) long long int]' at ?/jdk/src/java.base/linux/native/libsimdsort/avx512-32bit-qsort.hpp:170:27:
/usr/lib/gcc/x86_64-linux-gnu/12/include/avx512fintrin.h:4459:50: error: '__Y' is used uninitialized [-Werror=uninitialized]
4459 | return (__m512i) __builtin_ia32_pshufd512_mask ((__v16si) __A,
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~
4460 | __mask,
| ~~~~~~~
4461 | (__v16si)
| ~~~~~~~~~
4462 | _mm512_undefined_epi32 (),
| ~~~~~~~~~~~~~~~~~~~~~~~~~~
4463 | (__mmask16) -1);
| ~~~~~~~~~~~~~~~
/usr/lib/gcc/x86_64-linux-gnu/12/include/avx512fintrin.h: In function 'zmm_t sort_zmm_32bit(zmm_t) [with vtype = zmm_vector<int>; zmm_t = __vector(8) long long int]':
/usr/lib/gcc/x86_64-linux-gnu/12/include/avx512fintrin.h:206:11: note: '__Y' was declared here
206 | __m512i __Y = __Y;
| ^~~

@iaroslavski
Copy link
Contributor

@vamsi-parasa
Please disrard my request to change high -> end. I find out a way to update Java code only.
I updated my previous comment. Only items 2, 3, 5 should be performed.

@DanielThomas
Copy link

DanielThomas commented Oct 8, 2023

A discussion on Reddit raised that this had the potential to regress sort performance on AMD Zen 4. The poster didn't have access to Zen 4 hardware, so I took a moment to run the benchmark, comparing against a baseline with UseAVX=2 on an AWS m7a.2xlarge:

processor	: 0
vendor_id	: AuthenticAMD
cpu family	: 25
model		: 17
model name	: AMD EPYC 9R14
stepping	: 1
microcode	: 0xa10113e
cpu MHz		: 3673.537
cache size	: 1024 KB
physical id	: 0
siblings	: 8
core id		: 0
cpu cores	: 8
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 16
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext perfctr_core invpcid_single ssbd perfmon_v2 ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx512_bf16 clzero xsaveerptr rdpru wbnoinvd arat avx512vbmi pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid flush_l1d
bugs		: sysret_ss_attrs spectre_v1 spectre_v2 spec_store_bypass
bogomips	: 5200.00
TLB size	: 3584 4K pages
clflush size	: 64
cache_alignment	: 64
address sizes	: 48 bits physical, 48 bits virtual
power management:
$ make test TEST="micro:java.lang.ArraysSort.intSort"

Benchmark            (size)  Mode  Cnt      Score     Error  Units
ArraysSort.intSort       10  avgt    3      0.033 ?   0.001  us/op
ArraysSort.intSort       25  avgt    3      0.067 ?   0.002  us/op
ArraysSort.intSort       50  avgt    3      0.391 ?   0.007  us/op
ArraysSort.intSort       75  avgt    3      0.923 ?   0.041  us/op
ArraysSort.intSort      100  avgt    3      1.292 ?   0.009  us/op
ArraysSort.intSort     1000  avgt    3     24.268 ?   0.078  us/op
ArraysSort.intSort    10000  avgt    3    320.131 ?   0.381  us/op
ArraysSort.intSort   100000  avgt    3   4803.142 ?  63.901  us/op
ArraysSort.intSort  1000000  avgt    3  61457.708 ? 347.446  us/op
$ make test TEST="micro:java.lang.ArraysSort.intSort" MICRO="VM_OPTIONS=-XX:UseAVX=2"

Benchmark            (size)  Mode  Cnt      Score      Error  Units
ArraysSort.intSort       10  avgt    3      0.034 ?    0.001  us/op
ArraysSort.intSort       25  avgt    3      0.091 ?    0.008  us/op
ArraysSort.intSort       50  avgt    3      0.247 ?    0.022  us/op
ArraysSort.intSort       75  avgt    3      0.432 ?    0.005  us/op
ArraysSort.intSort      100  avgt    3      0.560 ?    0.018  us/op
ArraysSort.intSort     1000  avgt    3      8.848 ?    0.530  us/op
ArraysSort.intSort    10000  avgt    3    118.476 ?  478.914  us/op
ArraysSort.intSort   100000  avgt    3   4656.937 ?  110.323  us/op
ArraysSort.intSort  1000000  avgt    3  56598.595 ? 2749.859  us/op

This is explained in the linked comments/commits and this Twitter thread:

@PaulSandoz
Copy link
Member

@DanielThomas thank you, i was not aware of the limitation on AMD Zen 4. We need to address this and it appears the fix is simple and localized. Thankfully we are early into the JDK 22 development cycle so we have time to shake this out.

@PaulSandoz
Copy link
Member

Mailing list message from Florian Weimer on hotspot-runtime-dev:

I believe this has introduced a build failure with GCC 12.2 on Debian 12.1:

I believe it due to an issue in GCC 12 that has yet to be resolved:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105593

We (Oracle) currently build with GCC 11.2.0 (https://wiki.openjdk.org/display/Build/Supported+Build+Platforms) so i expect we will encounter the same problem when upgrading to the 12 toolchain.

@vamsi-parasa
Copy link
Contributor Author

Mailing list message from Florian Weimer on hotspot-runtime-dev:

I believe this has introduced a build failure with GCC 12.2 on Debian 12.1:

Building target 'jdk' in configuration '/home/fw/build/jdk' In file included from /usr/lib/gcc/x86_64-linux-gnu/12/include/immintrin.h:49, from ?/jdk/src/java.base/linux/native/libsimdsort/avx512-common-qsort.h:60, from ?/jdk/src/java.base/linux/native/libsimdsort/avx512-32bit-qsort.hpp:31, from ?/jdk/src/java.base/linux/native/libsimdsort/avx512-linux-qsort.cpp:26: In function '__m512i _mm512_shuffle_epi32(__m512i, _MM_PERM_ENUM)', inlined from 'static zmm_vector::zmm_t zmm_vector::shuffle(zmm_t) [with unsigned char mask = 177]' at ?/jdk/src/java.base/linux/native/libsimdsort/avx512-32bit-qsort.hpp:96:36, inlined from 'zmm_t sort_zmm_32bit(zmm_t) [with vtype = zmm_vector; zmm_t = __vector(8) long long int]' at ?/jdk/src/java.base/linux/native/libsimdsort/avx512-32bit-qsort.hpp:170:27: /usr/lib/gcc/x86_64-linux-gnu/12/include/avx512fintrin.h:4459:50: error: '__Y' is used uninitialized [-Werror=uninitialized] 4459 | return (__m512i) __builtin_ia32_pshufd512_mask ((__v16si) __A, | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~ 4460 | __mask, | ~~~~~~~ 4461 | (__v16si) | ~~~~~~~~~ 4462 | _mm512_undefined_epi32 (), | ~~~~~~~~~~~~~~~~~~~~~~~~~~ 4463 | (__mmask16) -1); | ~~~~~~~~~~~~~~~ /usr/lib/gcc/x86_64-linux-gnu/12/include/avx512fintrin.h: In function 'zmm_t sort_zmm_32bit(zmm_t) [with vtype = zmm_vector; zmm_t = __vector(8) long long int]': /usr/lib/gcc/x86_64-linux-gnu/12/include/avx512fintrin.h:206:11: note: '__Y' was declared here 206 | __m512i __Y = __Y; | ^~~

Hello Florian (@fweimer),

As Paul mentioned, this is an issue in GCC 12.
Will reproduce it on my side and issue a workaround PR shortly to address this issue.

Thanks,
Vamsi

@vamsi-parasa
Copy link
Contributor Author

A discussion on Reddit raised that this had the potential to regress sort performance on AMD Zen 4. The poster didn't have access to Zen 4 hardware, so I took a moment to run the benchmark, comparing against a baseline with UseAVX=2 on an AWS m7a.2xlarge:

Hello Daniel (@DanielThomas ),

Thank you for sharing the data. Will issue a new PR to restrict the AVX512 sort to Intel CPUs only for the time being.

Thanks,
Vamsi

@vamsi-parasa
Copy link
Contributor Author

vamsi-parasa commented Oct 9, 2023

Mailing list message from Florian Weimer on hotspot-runtime-dev:
I believe this has introduced a build failure with GCC 12.2 on Debian 12.1:

I believe it due to an issue in GCC 12 that has yet to be resolved:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105593

We (Oracle) currently build with GCC 11.2.0 (https://wiki.openjdk.org/display/Build/Supported+Build+Platforms) so i expect we will encounter the same problem when upgrading to the 12 toolchain.

Hello Paul (@PaulSandoz) and Florian (@fweimer),

Just did a successful OpenJDK build (without any of the errors reported) using the latest GCC 12 (12.3.1) . Also, reproduced the reported build failure using GCC 12.2.0.
Looks like this issue is fixed in gcc 12.3.1.

Thanks,
Vamsi

@IntrinsicCandidate
@ForceInline
private static <A> void sort(Class<?> elemType, A array, long offset, int low, int high, SortOperation<A> so) {
so.sort(array, low, high);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm late to the party, but how does the fallback implementation work (or is intended to) for off-heap case (null + absolute base address)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, for on-heap case the fallback implementation is equivalent to intrinsified case only when offset points at the 0th element of the array.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iwanowww Yes, you are late to the party :). The fallback implementation could be similar to the vectorizedMismatch regarding base/offset for non heap case. Please see java.base/share/classes/jdk/internal/foreign/AbstractMemorySegmentImpl.java.
Yes, the fallback implementation for non intrinsic case is kept to be equivalent to what was before the SIMD sort PR. This is done to not affect the fallback performance on other platforms.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with fallback implementation as it is now is that it doesn't take into account the offset. It's completely broken for off-heap case and works for on-heap case only because the implementation always passes primitive array base offset.

@himichael
Copy link

himichael commented Oct 13, 2023

The goal is to develop faster sort routines for x86_64 CPUs by taking advantage of AVX512 instructions. This enhancement provides an order of magnitude speedup for Arrays.sort() using int, long, float and double arrays.

This PR shows upto ~7x improvement for 32-bit datatypes (int, float) and upto ~4.5x improvement for 64-bit datatypes (long, double) as shown in the performance data below.

Arrays.sort performance data using JMH benchmarks for arrays with random data

Arrays.sort benchmark Array Size Baseline (us/op) AVX512 Sort (us/op) Speedup
ArraysSort.doubleSort 10 0.034 0.035 1.0
ArraysSort.doubleSort 25 0.116 0.089 1.3
ArraysSort.doubleSort 50 0.282 0.291 1.0
ArraysSort.doubleSort 75 0.474 0.358 1.3
ArraysSort.doubleSort 100 0.654 0.623 1.0
ArraysSort.doubleSort 1000 9.274 6.331 1.5
ArraysSort.doubleSort 10000 323.339 71.228 4.5
ArraysSort.doubleSort 100000 4471.871 1002.748 4.5
ArraysSort.doubleSort 1000000 51660.742 12921.295 4.0
ArraysSort.floatSort 10 0.045 0.046 1.0
ArraysSort.floatSort 25 0.103 0.084 1.2
ArraysSort.floatSort 50 0.285 0.33 0.9
ArraysSort.floatSort 75 0.492 0.346 1.4
ArraysSort.floatSort 100 0.597 0.326 1.8
ArraysSort.floatSort 1000 9.811 5.294 1.9
ArraysSort.floatSort 10000 323.955 50.547 6.4
ArraysSort.floatSort 100000 4326.38 731.152 5.9
ArraysSort.floatSort 1000000 52413.88 8409.193 6.2
ArraysSort.intSort 10 0.033 0.033 1.0
ArraysSort.intSort 25 0.086 0.051 1.7
ArraysSort.intSort 50 0.236 0.151 1.6
ArraysSort.intSort 75 0.416 0.332 1.3
ArraysSort.intSort 100 0.63 0.521 1.2
ArraysSort.intSort 1000 10.518 4.698 2.2
ArraysSort.intSort 10000 309.659 42.518 7.3
ArraysSort.intSort 100000 4130.917 573.956 7.2
ArraysSort.intSort 1000000 49876.307 6712.812 7.4
ArraysSort.longSort 10 0.036 0.037 1.0
ArraysSort.longSort 25 0.094 0.08 1.2
ArraysSort.longSort 50 0.218 0.227 1.0
ArraysSort.longSort 75 0.466 0.402 1.2
ArraysSort.longSort 100 0.76 0.58 1.3
ArraysSort.longSort 1000 10.449 6.239 1.7
ArraysSort.longSort 10000 307.074 70.284 4.4
ArraysSort.longSort 100000 4135.651 1049.12 3.9
ArraysSort.longSort 1000000 49559.152 12537.53 4.0
This is the first in the series of PRs related to vectorized sorting. Future planned steps after the integration of this PR:

  1. Enabling AVX2 sort and partitioning for Linux ( based on Intel's x86-simd-sort PR).
  2. Enabling SIMD sort (both AVX512 and AVX2) for Short and Char arrays for length < 1750. (Arrays larger than 1750 already use countingSort which is faster than AVX512 sort.)
  3. Adding Windows support for both AVX512 and AVX2 using assembly stubs.
  4. Enabling SIMD sort (both AVX512 and AVX2) for MemorySegment.

Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8309130: x86_64 AVX512 intrinsics for Arrays.sort methods (int, long, float and double arrays) (Enhancement - P4)

Reviewers

Reviewing

Using git
Using Skara CLI tools
Using diff file

Webrev

Link to Webrev Comment

hello, I tested this feature in openjdk 22.19, but it didn't seem to work.
my test code is as follows:

public class JDKSort {
    public static void main(String[] args) {
        new JDKSort().sort();
    }

    public void sort() {
        // 100 million pieces of data
        int size = 100000000;
        int[] arr = new int[size];
        java.util.Random rand = new java.util.Random();
        for (int i = 0; i < size; ++i) {
            arr[i] = rand.nextInt();
        }

        long startTime = System.currentTimeMillis();
        java.util.Arrays.sort(arr);
        long elapseTime = System.currentTimeMillis() - startTime;
        System.out.println("elapse time -> " + elapseTime + " ms");
    }
}

test result:

  • jdk8: 15079 ms
  • jdk22: 11619 ms

this feature looks like it's merged into 22.19:
x86_64 AVX512 intrinsics for Arrays.sort methods (int, long, float and double arrays)
but it doesn't seem to working, do I need to do anything?

@merykitty
Copy link
Member

@himichael Please refer to this question for how to correctly benchmark Java code.

@himichael
Copy link

@himichael Please refer to this question for how to correctly benchmark Java code.

thanks for your reply, but I think this has nothing to do with benchmark.
I sort a random array of int[], use: java.util.Arrays.sort()
in jdk 8, run command:
java JDKSort , the result display: 15079 ms

in open jdk 22.19 , run command:
java --add-modules jdk.incubator.vector JDKSort
the result display: 11619 ms

my question is that this feature should improve performance several times, but it doesn't look like there's much difference between open jdk 22.19 and jdk 8.
is there a problem with my configuration ?

@sviswa7
Copy link

sviswa7 commented Oct 13, 2023

@himichael Please try java -Xbatch -XX:-TieredCompilation JDKSort. The method DualPivotQuickSort.sort() needs to be C2 compiled for you to see the benefit.

@vamsi-parasa
Copy link
Contributor Author

my question is that this feature should improve performance several times, but it doesn't look like there's much difference between open jdk 22.19 and jdk 8. is there a problem with my configuration ?

Hello @himichael,

Using your code snippet, please see the output below using the latest JDK and JDK 20 (which does not have AVX512 sort):

JDK 20 (without AVX512 sort):
java -XX:CompileCommand=CompileThresholdScaling,java.util.DualPivotQuicksort::sort,0.0001 -XX:-TieredCompilation JDKSort

elapse time -> 7501 ms


JDK 22 (with AVX512 sort)
java -XX:CompileCommand=CompileThresholdScaling,java.util.DualPivotQuicksort::sort,0.0001 -XX:-TieredCompilation JDKSort
elapse time -> 1607 ms

It shows 4.66x speedup.

@himichael
Copy link

my question is that this feature should improve performance several times, but it doesn't look like there's much difference between open jdk 22.19 and jdk 8. is there a problem with my configuration ?

Hello @himichael,

Using your code snippet, please see the output below using the latest JDK and JDK 20 (which does not have AVX512 sort):

JDK 20 (without AVX512 sort): java -XX:CompileCommand=CompileThresholdScaling,java.util.DualPivotQuicksort::sort,0.0001 -XX:-TieredCompilation JDKSort

elapse time -> 7501 ms

JDK 22 (with AVX512 sort) java -XX:CompileCommand=CompileThresholdScaling,java.util.DualPivotQuicksort::sort,0.0001 -XX:-TieredCompilation JDKSort elapse time -> 1607 ms

It shows 4.66x speedup.

Hello, @vamsi-parasa
I used the commands you provided, but nothing seems to have changed.
The test procedure as follow:
use JDK 8(without AVX512 sort)

/data/soft/jdk1.8.0_371/bin/javac  JDKSort.java
/data/soft/jdk1.8.0_371/bin/java  JDKSort

elapse time -> 15309 ms

use OpenJDK 22.19(with AVX512 sort)

/data/soft/jdk-22/bin/javac JDKSort.java
/data/soft/jdk-22/bin/java -XX:CompileCommand=CompileThresholdScaling,java.util.DualPivotQuicksort::sort,0.0001 -XX:-TieredCompilation JDKSort
CompileCommand: CompileThresholdScaling java/util/DualPivotQuicksort.sort double CompileThresholdScaling = 0.000100

elapse time -> 11687 ms

Not much seems to have changed.

My JDK info:
OpenJDK 22.19:

/data/soft/jdk-22/bin/java -version
openjdk version "22-ea" 2024-03-19
OpenJDK Runtime Environment (build 22-ea+19-1460)
OpenJDK 64-Bit Server VM (build 22-ea+19-1460, mixed mode, sharing)

JDK 8:

/data/soft/jdk1.8.0_371/bin/java -version
java version "1.8.0_371"
Java(TM) SE Runtime Environment (build 1.8.0_371-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.371-b11, mixed mode)

I tested Intel's x86-simd-sort, my code as follow:

#include <iostream>
#include <vector>
#include <algorithm>
#include <chrono>
#include "src/avx512-32bit-qsort.hpp"

int main() {

    // 100 million records
    const int size = 100000000;
    std::vector<int> random_array(size);

    for (int i = 0; i < size; ++i) {
        random_array[i] = rand();
    }

    auto start_time = std::chrono::steady_clock::now();

    avx512_qsort(random_array.data(), size);

    auto end_time = std::chrono::steady_clock::now();
    auto elapse_time = std::chrono::duration_cast<std::chrono::milliseconds>(end_time - start_time).count();

    std::cout << "elapse time -> " << elapse_time << " ms" << std::endl;
    return 0;
}

compile commands:

g++ -o sort -O3 -mavx512f -mavx512dq sort.cpp

elapse time -> 1151 ms
An order of magnitude performance improvement.

Here is my cpu information:

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
On-line CPU(s) list:   0-7
Thread(s) per core:    1
Core(s) per socket:    1
Socket(s):             8
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 85
Model name:            Intel Xeon Processor (Skylake, IBRS)
Stepping:              4
CPU MHz:               2394.374
BogoMIPS:              4788.74
Hypervisor vendor:     KVM
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              4096K
NUMA node0 CPU(s):     0-7
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 md_clear spec_ctrl

lscpu | grep avx The following instructions are supported:

  • avx
  • avx2
  • avx512f
  • avx512dq
  • avx512cd
  • avx512bw
  • avx512vl

@merykitty
Copy link
Member

@himichael What do you mean by this having nothing to do with benchmark. You are trying to execute some code to measure its execution time, which is benchmarking. And you are doing that on only 1 simple function, which makes your benchmark micro.

To be more specific, this is a C2-specific optimisation, so only C2-compiled code is benefitted from it. As a result, you need to have the function compiled BEFORE starting the clock. Typically, this is ensured by executing the function repeatedly for several iterations (the current default value is 20000), starting the clock, executing the function several more times, stopping the clock and calculating the average throughput. As this is quite complex and contains non-trivial caveats, it is recommended to use JMH for microbenchmarks.

@vamsi-parasa
Copy link
Contributor Author

@himichael , could you check if the libsimdsort.so is being loaded by running the command below?
java -Xlog:library --version | grep libsimdsort

Here is my output:
[0.021s][info][library] Loaded library libsimdsort.so, handle 0x00007fa3c801ad80

@himichael
Copy link

@himichael , could you check if the libsimdsort.so is being loaded by running the command below? java -Xlog:library --version | grep libsimdsort

Here is my output: [0.021s][info][library] Loaded library libsimdsort.so, handle 0x00007fa3c801ad80

Hello, @vamsi-parasa,

I see why, I run command /data/soft/jdk-22/bin/java -Xlog:library -version
results as follows:

[0.071s][info][library] Loaded library libjsvml.so, handle 0x00007f3a5c0223e0
openjdk version "22-ea" 2024-03-19
OpenJDK Runtime Environment (build 22-ea+19-1460)
OpenJDK 64-Bit Server VM (build 22-ea+19-1460, mixed mode, sharing)

There is no information about libsimdsort.so

run /data/soft/jdk-22/bin/java -Xlog:library the results as follows:

[0.044s][info][library] Loaded library libjsvml.so, handle 0x00007f00b40223e0
[0.067s][info][library] Failed to find JNI_OnLoad_nio in library with handle 0x00007f016a4f6150
[0.068s][info][library] Loaded library /data/soft/jdk-22/lib/libnio.so, handle 0x00007f0160196da0
[0.068s][info][library] Found JNI_OnLoad in library with handle 0x00007f0160196da0
[0.069s][info][library] Found Java_sun_nio_fs_UnixNativeDispatcher_init in library with handle 0x00007f0160196da0
[0.069s][info][library] Found Java_sun_nio_fs_UnixNativeDispatcher_getcwd in library with handle 0x00007f0160196da0
[0.069s][info][library] Failed to find JNI_OnLoad_jimage in library with handle 0x00007f016a4f6150
[0.069s][info][library] Loaded library /data/soft/jdk-22/lib/libjimage.so, handle 0x00007f0160005450
[0.069s][info][library] Failed to find JNI_OnLoad in library with handle 0x00007f0160005450
[0.069s][info][library] Failed to find Java_jdk_internal_jimage_NativeImageBuffer_getNativeMap in library with handle 0x00007f0160196da0
[0.069s][info][library] Found Java_jdk_internal_jimage_NativeImageBuffer_getNativeMap in library with handle 0x00007f0160005450

It shows some bad failures, but it doesn't seem to have anything to do with libsimdsort.
I am not using a physical machine, I am using a virtual machine, this virtual machine supports the AVX512 instruction set.
How do I open libsimdsort ?

@amittoranainc
Copy link

amittoranainc commented Oct 24, 2023

can I use this function for sorting excel file data with multiple columns in it?

@murphye
Copy link

murphye commented Jan 3, 2024

sing a physical machine, I am using a virtual machine, this virtual machine supports the AVX512 instruction set.
How do I open libsimdsort ?

@himichael Did you ever resolve your issue? I am using JDK22 from SDKMan and also do not have libsimdsort.so:

 java --add-modules jdk.incubator.vector  -Xlog:library   
[0.013s][info][library] Loaded library libjsvml.so, handle 0x00007fc9a40229b0
[0.024s][info][library] Failed to find JNI_OnLoad_nio in library with handle 0x00007fca6b2dc220
[0.024s][info][library] Loaded library /home/eric/.sdkman/candidates/java/22.ea.29-open/lib/libnio.so, handle 0x00007fca6419b9b0
[0.024s][info][library] Found JNI_OnLoad in library with handle 0x00007fca6419b9b0
[0.024s][info][library] Found Java_sun_nio_fs_UnixNativeDispatcher_init in library with handle 0x00007fca6419b9b0
[0.024s][info][library] Found Java_sun_nio_fs_UnixNativeDispatcher_getcwd in library with handle 0x00007fca6419b9b0
[0.025s][info][library] Failed to find JNI_OnLoad_jimage in library with handle 0x00007fca6b2dc220
[0.025s][info][library] Loaded library /home/eric/.sdkman/candidates/java/22.ea.29-open/lib/libjimage.so, handle 0x00007fca64006380
[0.025s][info][library] Failed to find JNI_OnLoad in library with handle 0x00007fca64006380
[0.025s][info][library] Failed to find Java_jdk_internal_jimage_NativeImageBuffer_getNativeMap in library with handle 0x00007fca6419b9b0
[0.025s][info][library] Found Java_jdk_internal_jimage_NativeImageBuffer_getNativeMap in library with handle 0x00007fca64006380

However, I checked my lib directory and libsimdsort.so does indeed exist there. Seems there is some command line arg to java missing or wrong.

@tarsa
Copy link

tarsa commented Nov 23, 2024

The answer for slow performance of AVX512 version of x86-simd-sort on Zen 4 is most probably explained in AMD manuals which could be found at: https://www.amd.com/en/search/documentation/hub.html#q=software%20optimization%20guide%20for%20the%20amd%20microarchitecture&f-amd_document_type=Software%20Optimization%20Guides

Software Optimization Guide for the AMD Zen4 Microarchitecture has following remark in "2.11.2 Code recommendations" chapter:

Avoid the memory destination form of COMPRESS instructions. These forms are implemented using microcode and achieve a lower store bandwidth than their register destination forms which use fastpath macro ops.

Software Optimization Guide for the AMD Zen5 Microarchitecture doesn't have any remark about COMPRESS instructions.

Could you add some code that disables the AVX512 version on Zen4, but keeps it enabled on Zen5 and future Zen architectures?

@PaulSandoz
Copy link
Member

Could you add some code that disables the AVX512 version on Zen4, but keeps it enabled on Zen5 and future Zen architectures?

Or as you suggest here revert to AVX2. I updated JDK-8317976 with that suggestion, which is simpler to maintain. The HotSpot C++ class VM_Version might need updating to return the Zen version number.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.