Adding new HWY_AVX10_2 target #2348

johnplatts · 2024-10-09T14:00:27Z

The upcoming Intel AVX10.2 instruction set (which is described in the specification that can be found at https://www.intel.com/content/www/us/en/content-details/828965/intel-advanced-vector-extensions-10-2-intel-avx10-2-architecture-specification.html) adds the following operations:

BF16 Add/Sub/Mul/Div/Sqrt/[Neg]MulAdd/[Neg]MulSub/ApproximateReciprocal[Sqrt]
BF16 Eq/Ne/Le/Lt/Ge/Gt/Min/Max
IEEE 754-2019 Min/Max for BF16/F16/F32/F64 vectors
BF16/F16/F32/F64 MinMagnitude (equivalent to IfThenElse(Lt(Abs(a), Abs(b)), a, b) if both a[i] and b[i] are non-NaN)
BF16/F16/F32/F64 MaxMagnitude (equivalent to IfThenElse(Lt(Abs(a), Abs(b)), b, a) if both a[i] and b[i] are non-NaN)
F16/BF16/F32->I8/U8 DemoteTo (there is already a use case for F16->I8/U8 DemoteTo in the implementation of I8/U8 Div on AVX3_SPR/AVX10_2/NEON_BF16)
F32->F16 OrderedDemote2To
New floating-point to integer PromoteTo/ConvertTo/DemoteTo instructions that saturate out-of-range non-NaN values to be within the range of the target integer type and convert NaNs to 0
F16->F32 WidenMulPairwiseAdd
U16xU16->U32 WidenMulPairwiseAdd/SatWidenMulPairwiseAccumulate/ReorderWidenMulAccumulate (originally introduced in AVX-VNNI-INT16, but extended to include 512-bit vectors on AVX10.2 CPU's that support 512-bit vectors)
I8xI8->I32 and U8xU8->I32 SumOfMulQuadAccumulate (originally introduced in AVX-VNNI-INT8, but extended to include 512-bit vectors on AVX10.2 CPU's that support 512-bit vectors)

GCC 15 and Clang 20, which are currently under development and scheduled to be released in Spring 2025, will have support for the new AVX10.2 intrinsics.

The new _mm*_cvttsp[h,s,d]_epi* intrinsics available on AVX10.2 should also fix the undefined behavior that is there with the conversion of out-of-range floating-point vectors to integer vectors with GCC (and this issue was described at #2183).

Also need to move some of the ops for 256-bit or smaller vectors that are currently implemented in the hwy/ops/x86_512-inl.h header on AVX3 targets into a separate header as support for 512-bit vectors is optional on AVX10.2.

The text was updated successfully, but these errors were encountered:

jan-wassenberg · 2024-10-10T14:25:38Z

Thanks for starting the discussion! Looks like GNR has also just been introduced/launched, but that supports 10.1, I think.

Min/MaxNumber (Min with proper NaN handling per IEEE754:2019) and Min/MaxMagnitude look useful, as does F16 WidenMulPairwiseAdd. Would be very happy to see those added :)
I don't see a burning need for bf16 ops. This target is AFAIK the only platform that has them, and just about the only demand I see for bf16 is mul/add, which is mostly covered by the existing WidenMul.

I agree we'd want to split the "AVX3" and "512-bit" aspects of x86_512-inl.h.

How about I make a TODO for around 2025-03 to lay the groundwork by creating the HWY_AVX10_2 (or HWY_AVX102?) target/boilerplate? Would you later like to add some of its functionality?

johnplatts · 2024-10-11T02:31:34Z

MinMagnitude/MaxMagnitude ops are implemented in pull request #2353.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding new HWY_AVX10_2 target #2348

Adding new HWY_AVX10_2 target #2348

johnplatts commented Oct 9, 2024

jan-wassenberg commented Oct 10, 2024

johnplatts commented Oct 11, 2024

Adding new HWY_AVX10_2 target #2348

Adding new HWY_AVX10_2 target #2348

Comments

johnplatts commented Oct 9, 2024

jan-wassenberg commented Oct 10, 2024

johnplatts commented Oct 11, 2024