Skip to content

Releases: intel/x86-simd-sort

v6.0

04 Nov 16:49
9ab7d47
Compare
Choose a tag to compare

Release 6.0 introduces several key enhancements to our software. The update brings new APIs for partial sorting of key-value pairs, along with comprehensive AVX2 support across all routines. Additionally, PyTorch now leverages AVX-512 and AVX2 key-value sort routines to boost the performance of torch.sort and torch.argsort functions by up-to 10x (see pytorch/pytorch#127936). We thank the community for all their contributions. Here is a quick summary of all the changes:

New features:

  • Support for qselect and partial sort for key-value data types #151.
  • Accelerate key-value sort using openMP pragmas #153.
  • AVX2 support for key-value sort, partial sort and objsort methods #145, .
  • Simply use of ISA specific sort methods with a unified static API #143.
  • Support for descending order sort for all the sort routines #140, #144, #151.
  • Intel LLVM compiler support #131.

Improvements:

  • Improved pivot selection for better performance on data with few unique values #127, #168.

New Contributors

Full Changelog: v5.0...v6.0

v5.0

12 Feb 19:53
5b5884c
Compare
Choose a tag to compare

Release 5.0 adds new API's to support sorting arrays of custom defined objects and sorting key-value pairs of arrays. We also now have AVX2 acceleration for argsort and argselect methods. Here is a gist of the new features:

  • x86-simd-sort now supports an API to sort custom defined C++ objects via object_qsort. Please refer to README file on how to use it. Its performance can vary depending on the definition of the custom class. For the sake of illustration, sorting a simple struct based on one of its members can be up-to 4-5x faster than using std::sort on machines with AVX-512. Refer to the perf section for more details.
  • New API keyvalue_qsort to sort a pair of arrays representing key-value pairs (32-bit and 64-bit data types). These are an order of magnitude faster when compared to scalar ways of sorting them.
  • AVX2 support for argsort and argselect methods. These have been merged into NumPy and will be available with NumPy v2.0.

What's Changed

New Contributors

Full Changelog: v4.0...v5.0

v4.0

31 Oct 16:17
7559f70
Compare
Choose a tag to compare

v4.0 is a significant release with new features and improvements. AVX-512 sorting methods gain up to 2x perf improvements and we have added AVX2 sorting methods to support a wider range of x86 processors. In additional to using it as a header file library, x86-simd-sort can be installed as a library, and it provides API access to various sorting methods with automatic runtime dispatch to select the fastest version based on the processor. Here is a quick summary of all the changes:

  • Added AVX2 implementations of avx2_qsort, avx2_qselect and avx2_partial_qsort for 32-bit and 64-bit data types. When compared to std::sort, these are up to 12x faster for 32-bit data and up to 7x faster for 64-bit data.
  • x86-simd-sort can now be built and installed as a shared library. The library provides runtime dispatch and automatically picks the fastest version among AVX-512/AVX2/scalar depending on the processor it is run on. Starting with clearlinux v40270, you can install x86-simd-sort with swupd bundle-add x86-simd-sort.
  • Perf improvements to avx512_qsort: 2x speed up for 32-bit data, 1.5x speed up for 64-bit data and 1.25x speed up for 16-bit data.
  • Perf improvements to avx512_argsort and avx512_argselect intended to mitigate the effect of a vulnerability in gather instruction.

What's Changed

New Contributors

Full Changelog: v3.0...4.0.rc

v3.0

06 Oct 20:49
b9f9340
Compare
Choose a tag to compare

Version 3.0 release contains a new supported method avx512_argselect to compute arg nth_element (also known as argpartition in NumPy). It returns an array of indices that would partition the data array. Highlights of this release include:

  1. v3.0 x86-simd-sort is merged into NumPy main branch. It provides AVX-512 vectorized versions of np.partition and np.argpartition . It speeds up np.partition up by up to 25x for 16-bit, 17x for 32-bit dtypes and about 8x speed up for 64-bit dtypes. Speeds up for np.argpartition are up-to 6.5x.
  2. A slightly modified version of x86-simd-sort has now been merged into OpenJDK . It speeds up sorting 32-bit and 64-bit data by up to 15x and 7x respectively.

What's Changed

Full Changelog: v2.0...v3.0

v2.0

23 Jun 19:59
587af58
Compare
Choose a tag to compare

Release v2.0 adds a host of new sorting algorithms and supported API's. We also welcome new members to the community who contributed heavily to this release and we are deeply grateful for their efforts. We’d like to acknowledge Mitchell J. O'Sullivan @mosullivan93 and @ruclz /Xi Tang @xtangxtang for adding partial sorting methods and key-value sorting methods.. Here are the main highlights of this release:

New features:

  • AVX-512 based argsort algorithms using O(1) space for 32-bit and 64-bit data types [1]. It returns the indices that would sort an array. These are up to 6x faster when compared to a scalar solution that uses std::sort. This new feature is leveraged by NumPy in np.argsort [2] in its latest release v1.25.
  • AVX-512 based quick select algorithm [3] for 16-bit, 32-bit and 64-bit data types. These are equivalent to std::nthelement but performs a lot faster. The performance depends on the ratio K/N (where K is the index of the element the array is partitioned around and N is the array size). For smaller values of K, it is up to 6x faster for 64-bit data, up to 15x faster for 32-bit data and up to 7x faster for 16-bit data. The performance gets better as K gets larger (see #13 (comment) for more details).
  • AVX-512 based partial sort algorithm [3] for 16-bit, 32-bit and 64-bit data types. These are equivalent to std::partial_sort. As with quick select, its performance depends on the ratio K/N (where K is the size of partial sorted array and N is the array size) and it tends to perform an order of magnitude faster for larger values. It is about 1.05x faster for tiny partial array sort and up to 20x faster for slightly larger partial arrays. See this #49 (comment) for detailed performance numbers.
  • AVX-512 based key-value sort [4]. This sorts a pair of key-value arrays and is currently supported only for 64-bit data types. A version of this has been contributed to Oceanbase, an open source distributed relational database. See [8].
  • AVX-512 sort for _Float16 data type [5] using AVX-512 FP16 ISA. In NumPy, these are nearly 3x faster than AVX-512 based sort that emulates float16 data type [7] .

Improvements:

  • Improvements to AVX-512 quicksort [6] further speeds up sorting 64 and 32-bit data types by up to 1.6x and 1.3x respectively.

New Contributors

Pull request references:

[1] Add AVX-512 argsort for 32 and 64-bit data type by @r-devulap in #34
[2] numpy/numpy#23707
[3] Implement partial sorting algorithms by @mosullivan93 in #13
[4] add key-value file! by @ruclz in #2
[5] Add qsort for _Float16 using AVX-512 FP16 ISA by @r-devulap in #16
[6] Improve qsort by @r-devulap in #33
[7] numpy/numpy#23435
[8] oceanbase/oceanbase#1325

Full Changelog: v1.0...v2.0

v1.0

07 Mar 22:46
7d7591c
Compare
Choose a tag to compare

First release of x86-simd-sort. This is the version that was merged into NumPy main branch. See numpy/numpy#22315 for more details. Supported features:

  1. AVX-512 quicksort for the following dtypes: float16, uint16_t, int16_t, float, uint32_t, int32_t, double, uint64_t, int64_t.