Add AArch64 SVE implementation for TCoeffOps fastFwdCore_2D #474
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
A bit more CMake framework code to support building SVE code, followed by an initial bit of SVE code that makes use of this infrastructure. This commit does not attempt to set SVE as a default-enabled target for now.
Wire up CMake to enable compiling files with AArch64 SVE/SVE2
This commit enables compilation for new
sve/
andsve2/
subdirectories underCommonLib/arm/
, assuming that the appropriate-march=...
flags are available.Introduce a new
set_if_compiler_supports_arm_extensions
function to set up variables based on whether the compiler supports SVE and SVE2, then use these along with the existing requirements (SVE2 requires SVE to be enabled, SVE requires Neon to be enabled) to determine what features ultimately end up being enabled.The implementation of the extension checking helper function needs to check (a) whether the flag is supported, and (b) whether code will successfully compile when using that flag. The latter is needed since there are some old versions of LLVM that are missing the
arm_neon_sve_bridge.h
header, and LLVM currently fails to compile SVE code when targeting Windows.Add AArch64 SVE implementation for
TCoeffOps
fastFwdCore_2D
The SVE 16-bit dot-product instructions allow us to accumulate twice as much data per instruction compared to Neon multiply-add instructions, giving a good speedup for the
fastFwdCore_2D
kernels.Compared to Neon with a fixed vector length of 128 bits, SVE allows different micro-architectures to expose a number of different vector lengths: 128, 256, 512, 1024, or 2048 bits. To take advantage of this we can rewrite the innermost loop of
fastFwdCore_2D
to be expressed in terms of the number of vectors to process rather than the number of elements, and then pick the number of iterations at setup-time by inspecting the vector length. This allows us to largely avoid needing an entire set of kernels for each possible vector length.One caveat to the notion of having completely vector-length agnostic kernels is that when the vector-length is known to be exactly 128-bits (the same as Neon) we can make use of some Neon instructions to speed up processing the data after the accumulation. This is possible since Neon and SVE registers share the low 128-bits of each vector register.
For this commit we have not attempted to add kernels that process less than a full vector's worth of data per inner loop iteration, which would enable using these kernels on machines with very wide vectors (512, 1024, or 2048 bits). This is technically straightforward since SVE supports partial vectors via predication, however there are no known long-vector micro-architectures available at present to justify maintaining such code.
Running a video encoding job on SVE-capable machines using the
--preset=fast
setting shows the following improvements in reported FPS: