Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding new HWY_AVX10_2 target #2348

Open
johnplatts opened this issue Oct 9, 2024 · 2 comments
Open

Adding new HWY_AVX10_2 target #2348

johnplatts opened this issue Oct 9, 2024 · 2 comments

Comments

@johnplatts
Copy link
Contributor

The upcoming Intel AVX10.2 instruction set (which is described in the specification that can be found at https://www.intel.com/content/www/us/en/content-details/828965/intel-advanced-vector-extensions-10-2-intel-avx10-2-architecture-specification.html) adds the following operations:

  • BF16 Add/Sub/Mul/Div/Sqrt/[Neg]MulAdd/[Neg]MulSub/ApproximateReciprocal[Sqrt]
  • BF16 Eq/Ne/Le/Lt/Ge/Gt/Min/Max
  • IEEE 754-2019 Min/Max for BF16/F16/F32/F64 vectors
  • BF16/F16/F32/F64 MinMagnitude (equivalent to IfThenElse(Lt(Abs(a), Abs(b)), a, b) if both a[i] and b[i] are non-NaN)
  • BF16/F16/F32/F64 MaxMagnitude (equivalent to IfThenElse(Lt(Abs(a), Abs(b)), b, a) if both a[i] and b[i] are non-NaN)
  • F16/BF16/F32->I8/U8 DemoteTo (there is already a use case for F16->I8/U8 DemoteTo in the implementation of I8/U8 Div on AVX3_SPR/AVX10_2/NEON_BF16)
  • F32->F16 OrderedDemote2To
  • New floating-point to integer PromoteTo/ConvertTo/DemoteTo instructions that saturate out-of-range non-NaN values to be within the range of the target integer type and convert NaNs to 0
  • F16->F32 WidenMulPairwiseAdd
  • U16xU16->U32 WidenMulPairwiseAdd/SatWidenMulPairwiseAccumulate/ReorderWidenMulAccumulate (originally introduced in AVX-VNNI-INT16, but extended to include 512-bit vectors on AVX10.2 CPU's that support 512-bit vectors)
  • I8xI8->I32 and U8xU8->I32 SumOfMulQuadAccumulate (originally introduced in AVX-VNNI-INT8, but extended to include 512-bit vectors on AVX10.2 CPU's that support 512-bit vectors)

GCC 15 and Clang 20, which are currently under development and scheduled to be released in Spring 2025, will have support for the new AVX10.2 intrinsics.

The new _mm*_cvttsp[h,s,d]_epi* intrinsics available on AVX10.2 should also fix the undefined behavior that is there with the conversion of out-of-range floating-point vectors to integer vectors with GCC (and this issue was described at #2183).

Also need to move some of the ops for 256-bit or smaller vectors that are currently implemented in the hwy/ops/x86_512-inl.h header on AVX3 targets into a separate header as support for 512-bit vectors is optional on AVX10.2.

@jan-wassenberg
Copy link
Member

Thanks for starting the discussion! Looks like GNR has also just been introduced/launched, but that supports 10.1, I think.

Min/MaxNumber (Min with proper NaN handling per IEEE754:2019) and Min/MaxMagnitude look useful, as does F16 WidenMulPairwiseAdd. Would be very happy to see those added :)
I don't see a burning need for bf16 ops. This target is AFAIK the only platform that has them, and just about the only demand I see for bf16 is mul/add, which is mostly covered by the existing WidenMul.

I agree we'd want to split the "AVX3" and "512-bit" aspects of x86_512-inl.h.

How about I make a TODO for around 2025-03 to lay the groundwork by creating the HWY_AVX10_2 (or HWY_AVX102?) target/boilerplate? Would you later like to add some of its functionality?

@johnplatts
Copy link
Contributor Author

MinMagnitude/MaxMagnitude ops are implemented in pull request #2353.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants