Releases: intel/x86-simd-sort
v6.0
Release 6.0 introduces several key enhancements to our software. The update brings new APIs for partial sorting of key-value pairs, along with comprehensive AVX2 support across all routines. Additionally, PyTorch now leverages AVX-512 and AVX2 key-value sort routines to boost the performance of torch.sort
and torch.argsort
functions by up-to 10x (see pytorch/pytorch#127936). We thank the community for all their contributions. Here is a quick summary of all the changes:
New features:
- Support for qselect and partial sort for key-value data types #151.
- Accelerate key-value sort using openMP pragmas #153.
- AVX2 support for key-value sort, partial sort and objsort methods #145, .
- Simply use of ISA specific sort methods with a unified static API #143.
- Support for descending order sort for all the sort routines #140, #144, #151.
- Intel LLVM compiler support #131.
Improvements:
New Contributors
- @cmsxbc made their first contribution in #155
- @ron-at-swgy made their first contribution in #157
- @yixuan made their first contribution in #159
- @WilliamTambellini made their first contribution in #164
- @yinqiwen made their first contribution in #165
Full Changelog: v5.0...v6.0
v5.0
Release 5.0 adds new API's to support sorting arrays of custom defined objects and sorting key-value pairs of arrays. We also now have AVX2 acceleration for argsort
and argselect
methods. Here is a gist of the new features:
- x86-simd-sort now supports an API to sort custom defined C++ objects via
object_qsort
. Please refer to README file on how to use it. Its performance can vary depending on the definition of the custom class. For the sake of illustration, sorting a simplestruct
based on one of its members can be up-to 4-5x faster than usingstd::sort
on machines with AVX-512. Refer to the perf section for more details. - New API
keyvalue_qsort
to sort a pair of arrays representing key-value pairs (32-bit and 64-bit data types). These are an order of magnitude faster when compared to scalar ways of sorting them. - AVX2 support for
argsort
andargselect
methods. These have been merged into NumPy and will be available with NumPy v2.0.
What's Changed
- README.md: fix broken link by @rouault in #98
- Improve emulation of AVX2 min/max 64-bit by @r-devulap in #99
- fix numpy CI failures by @r-devulap in #100
- Add key-value sort to runtime dispatch by @r-devulap in #105
- Avoid masks when possible in AVX2 logic by @sterrettm2 in #104
- Add more benchmarks by @r-devulap in #106
- Support key-value sort for 32-bit dtypes by @r-devulap in #108
- Add API to sort array of custom objects by @r-devulap in #103
- Mark explicit template specializations inline by @r-devulap in #112
- BUG: bug fix in avx512_qsort_fp16 by @r-devulap in #113
- CI: build and test on 32-bit linux by @r-devulap in #114
- Support for AVX2 argsort/argselect/key-value sort by @sterrettm2 in #110
- Changes argsort/argselect to use generic networks by @sterrettm2 in #102
- Improve key-value sort performance by @r-devulap in #120
- Add IPP sort to benchmarks by @r-devulap in #121
- Build fix on macOS 64-bit by @r-devulap in #122
- Fix more build issues on macOS by @r-devulap in #123
- Add avx2_vector defintion for size_t on macOS by @r-devulap in #124
- [fix] update link in README.md by @icfaust in #125
- Use uint32_t instead of size_t for object sort by @r-devulap in #126
- Add simple test for objsort by @r-devulap in #128
- Update README with object_qsort by @r-devulap in #130
New Contributors
Full Changelog: v4.0...v5.0
v4.0
v4.0 is a significant release with new features and improvements. AVX-512 sorting methods gain up to 2x perf improvements and we have added AVX2 sorting methods to support a wider range of x86 processors. In additional to using it as a header file library, x86-simd-sort can be installed as a library, and it provides API access to various sorting methods with automatic runtime dispatch to select the fastest version based on the processor. Here is a quick summary of all the changes:
- Added AVX2 implementations of
avx2_qsort
,avx2_qselect
andavx2_partial_qsort
for 32-bit and 64-bit data types. When compared tostd::sort
, these are up to 12x faster for 32-bit data and up to 7x faster for 64-bit data. - x86-simd-sort can now be built and installed as a shared library. The library provides runtime dispatch and automatically picks the fastest version among AVX-512/AVX2/scalar depending on the processor it is run on. Starting with
clearlinux
v40270, you can install x86-simd-sort withswupd bundle-add x86-simd-sort
. - Perf improvements to
avx512_qsort
: 2x speed up for 32-bit data, 1.5x speed up for 64-bit data and 1.25x speed up for 16-bit data. - Perf improvements to
avx512_argsort
andavx512_argselect
intended to mitigate the effect of a vulnerability in gather instruction.
What's Changed
- Changes quicksort and quickselect to use template based sorting networks by @sterrettm2 in #61
- Correct documentation on NaN behavior by @zbjornson in #73
- update CI that builds NumPy by @r-devulap in #79
- Fix MSVC build error by @sterrettm2 in #76
- Use scalar emulation of gather instruction for arg methods by @r-devulap in #65
- Bug fix in avx512_qselect_fp16 by @r-devulap in #80
- Build shared library with runtime ISA dispatch by @r-devulap in #74
- CI: Do not install google benchmarks and use different compiler versions by @r-devulap in #82
- Add option to compare benchmarks across branches by @r-devulap in #84
- Use env var to disable/enable avx512 dispatch by @r-devulap in #85
- Various performance improvements by @sterrettm2 in #83
- Fix NumPy CI failures by @r-devulap in #86
- CI: numpy/core is renamed to numpy/_core by @r-devulap in #87
- Enable argsort and argselect methods on 32-bit platforms by @r-devulap in #88
- Disable prefetch on clang-cl WIN32 by @r-devulap in #89
- Explicitly initialize with zmm_vector<uint64_t> for arg methods on macOS by @r-devulap in #90
- Reorganize code and add examples to build by @r-devulap in #91
- Adds support for AVX2 for 32-bit types for quicksort and quickselect by @sterrettm2 in #60
- Add hasnan = false to all the sort methods by @r-devulap in #92
- Provide a way to install x86simdsort as a library by @r-devulap in #94
- AVX2 64-bit support by @sterrettm2 in #93
- Add soversion to the lib by @r-devulap in #95
New Contributors
- @sterrettm2 made their first contribution in #61
- @zbjornson made their first contribution in #73
Full Changelog: v3.0...4.0.rc
v3.0
Version 3.0 release contains a new supported method avx512_argselect
to compute arg nth_element
(also known as argpartition
in NumPy). It returns an array of indices that would partition the data array. Highlights of this release include:
- v3.0 x86-simd-sort is merged into NumPy main branch. It provides AVX-512 vectorized versions of
np.partition
andnp.argpartition
. It speeds upnp.partition
up by up to 25x for 16-bit, 17x for 32-bit dtypes and about 8x speed up for 64-bit dtypes. Speeds up fornp.argpartition
are up-to 6.5x. - A slightly modified version of x86-simd-sort has now been merged into OpenJDK . It speeds up sorting 32-bit and 64-bit data by up to 15x and 7x respectively.
What's Changed
- Fix typo in README by @r-devulap in #50
- Update workflow for changes in benchmark tooling by @mosullivan93 in #54
- Further Makefile updates by @mosullivan93 in #52
- Add avx512_argselect for 32-bit and 64-bit dtypes by @r-devulap in #56
- Use __builtin_cpu_supports instead of cpuinfo by @r-devulap in #58
- MAINT: Remove template specializations for quicksort methods by @r-devulap in #59
- Add benchmarks for small arrays by @r-devulap in #62
- Improvement to benchmarking scripts by @r-devulap in #66
- Bug fix in benchmark script by @r-devulap in #67
- Use global Macros for GCC specific keywords by @r-devulap in #68
- Fix compiler warnings in src and tests by @r-devulap in #69
- Fix more compiler warnings by @r-devulap in #70
Full Changelog: v2.0...v3.0
v2.0
Release v2.0 adds a host of new sorting algorithms and supported API's. We also welcome new members to the community who contributed heavily to this release and we are deeply grateful for their efforts. We’d like to acknowledge Mitchell J. O'Sullivan @mosullivan93 and @ruclz /Xi Tang @xtangxtang for adding partial sorting methods and key-value sorting methods.. Here are the main highlights of this release:
New features:
- AVX-512 based
argsort
algorithms using O(1) space for 32-bit and 64-bit data types [1]. It returns the indices that would sort an array. These are up to 6x faster when compared to a scalar solution that usesstd::sort
. This new feature is leveraged by NumPy innp.argsort
[2] in its latest release v1.25. - AVX-512 based quick select algorithm [3] for 16-bit, 32-bit and 64-bit data types. These are equivalent to
std::nthelement
but performs a lot faster. The performance depends on the ratio K/N (where K is the index of the element the array is partitioned around and N is the array size). For smaller values of K, it is up to 6x faster for 64-bit data, up to 15x faster for 32-bit data and up to 7x faster for 16-bit data. The performance gets better as K gets larger (see #13 (comment) for more details). - AVX-512 based partial sort algorithm [3] for 16-bit, 32-bit and 64-bit data types. These are equivalent to
std::partial_sort
. As with quick select, its performance depends on the ratio K/N (where K is the size of partial sorted array and N is the array size) and it tends to perform an order of magnitude faster for larger values. It is about 1.05x faster for tiny partial array sort and up to 20x faster for slightly larger partial arrays. See this #49 (comment) for detailed performance numbers. - AVX-512 based key-value sort [4]. This sorts a pair of key-value arrays and is currently supported only for 64-bit data types. A version of this has been contributed to Oceanbase, an open source distributed relational database. See [8].
- AVX-512 sort for _Float16 data type [5] using AVX-512 FP16 ISA. In NumPy, these are nearly 3x faster than AVX-512 based sort that emulates float16 data type [7] .
Improvements:
- Improvements to AVX-512 quicksort [6] further speeds up sorting 64 and 32-bit data types by up to 1.6x and 1.3x respectively.
New Contributors
- @ruclz made their first contribution in #2
- @bkmgit made their first contribution in #21
- @mosullivan93 made their first contribution in #13
Pull request references:
[1] Add AVX-512 argsort for 32 and 64-bit data type by @r-devulap in #34
[2] numpy/numpy#23707
[3] Implement partial sorting algorithms by @mosullivan93 in #13
[4] add key-value file! by @ruclz in #2
[5] Add qsort for _Float16 using AVX-512 FP16 ISA by @r-devulap in #16
[6] Improve qsort by @r-devulap in #33
[7] numpy/numpy#23435
[8] oceanbase/oceanbase#1325
Full Changelog: v1.0...v2.0
v1.0
First release of x86-simd-sort. This is the version that was merged into NumPy main branch. See numpy/numpy#22315 for more details. Supported features:
- AVX-512 quicksort for the following dtypes:
float16, uint16_t, int16_t, float, uint32_t, int32_t, double, uint64_t, int64_t.