Add AArch64 SVE implementation for TCoeffOps fastFwdCore_2D #474

georges-arm · 2024-11-25T17:08:38Z

A bit more CMake framework code to support building SVE code, followed by an initial bit of SVE code that makes use of this infrastructure. This commit does not attempt to set SVE as a default-enabled target for now.

Wire up CMake to enable compiling files with AArch64 SVE/SVE2

This commit enables compilation for new sve/ and sve2/ subdirectories under CommonLib/arm/, assuming that the appropriate -march=... flags are available.

Introduce a new set_if_compiler_supports_arm_extensions function to set up variables based on whether the compiler supports SVE and SVE2, then use these along with the existing requirements (SVE2 requires SVE to be enabled, SVE requires Neon to be enabled) to determine what features ultimately end up being enabled.

The implementation of the extension checking helper function needs to check (a) whether the flag is supported, and (b) whether code will successfully compile when using that flag. The latter is needed since there are some old versions of LLVM that are missing the arm_neon_sve_bridge.h header, and LLVM currently fails to compile SVE code when targeting Windows.

Add AArch64 SVE implementation for `TCoeffOps` `fastFwdCore_2D`

The SVE 16-bit dot-product instructions allow us to accumulate twice as much data per instruction compared to Neon multiply-add instructions, giving a good speedup for the fastFwdCore_2D kernels.

Compared to Neon with a fixed vector length of 128 bits, SVE allows different micro-architectures to expose a number of different vector lengths: 128, 256, 512, 1024, or 2048 bits. To take advantage of this we can rewrite the innermost loop of fastFwdCore_2D to be expressed in terms of the number of vectors to process rather than the number of elements, and then pick the number of iterations at setup-time by inspecting the vector length. This allows us to largely avoid needing an entire set of kernels for each possible vector length.

One caveat to the notion of having completely vector-length agnostic kernels is that when the vector-length is known to be exactly 128-bits (the same as Neon) we can make use of some Neon instructions to speed up processing the data after the accumulation. This is possible since Neon and SVE registers share the low 128-bits of each vector register.

For this commit we have not attempted to add kernels that process less than a full vector's worth of data per inner loop iteration, which would enable using these kernels on machines with very wide vectors (512, 1024, or 2048 bits). This is technically straightforward since SVE supports partial vectors via predication, however there are no known long-vector micro-architectures available at present to justify maintaining such code.

Running a video encoding job on SVE-capable machines using the --preset=fast setting shows the following improvements in reported FPS:

Neoverse V1 (VL=256 bits): ~1.3%
Neoverse V2 (VL=128 bits): ~2.6%

cmake/modules/vvencCompilerSupport.cmake

source/Lib/vvenc/CMakeLists.txt

This commit enables compilation for the new `sve/` and `sve2/` subdirectories under `CommonLib/arm/`, assuming that the appropriate `-march=...` flags are available. Introduce a new `set_if_compiler_supports_arm_extensions` function to set up variables based on whether the compiler supports SVE and SVE2, then use these along with the existing requirements (SVE2 requires SVE to be enabled, SVE requires Neon to be enabled) to determine what features ultimately end up being enabled. The implementation of the extension checking helper function needs to check (a) whether the flag is supported, and (b) whether code will successfully compile when using that flag. The latter is needed since there are some old versions of LLVM that are missing the `arm_neon_sve_bridge.h` header, and LLVM currently fails to compile SVE code when targeting Windows.

The SVE 16-bit dot-product instructions allow us to accumulate twice as much data per instruction compared to Neon multiply-add instructions, giving a good speedup for the fastFwdCore_2D kernels. Compared to Neon with a fixed vector length of 128 bits, SVE allows different micro-architectures to expose a number of different vector lengths: 128, 256, 512, 1024, or 2048 bits. To take advantage of this we can rewrite the innermost loop of fastFwdCore_2D to be expressed in terms of the number of vectors to process rather than the number of elements, and then pick the number of iterations at setup-time by inspecting the vector length. This allows us to largely avoid needing an entire set of kernels for each possible vector length. One caveat to the notion of having completely vector-length agnostic kernels is that when the vector-length is known to be exactly 128-bits (the same as Neon) we can make use of some Neon instructions to speed up processing the data after the accumulation. This is possible since Neon and SVE registers share the low 128-bits of each vector register. For this commit we have not attempted to add kernels that process less than a full vector's worth of data per inner loop iteration, which would enable using these kernels on machines with very wide vectors (512, 1024, or 2048 bits). This is technically straightforward since SVE supports partial vectors via predication, however there are no known long-vector micro-architectures available at present to justify maintaining such code. Running a video encoding job on SVE-capable machines using the --preset=fast setting shows the following improvements in reported FPS: Neoverse V1 (VL=256 bits): ~1.3% Neoverse V2 (VL=128 bits): ~2.6%

K-os

Thanks, looks good now.

georges-arm mentioned this pull request Nov 25, 2024

Add AArch64 Neon implementation of motionErrorLumaInt8 #465

Merged

adamjw24 requested a review from K-os November 26, 2024 08:57

K-os suggested changes Nov 26, 2024

View reviewed changes

cmake/modules/vvencCompilerSupport.cmake Outdated Show resolved Hide resolved

source/Lib/vvenc/CMakeLists.txt Outdated Show resolved Hide resolved

georges-arm added 2 commits November 28, 2024 13:15

georges-arm force-pushed the geoste01/fastFwdCore_2D-sve branch from c6a6843 to 4c8bfef Compare November 28, 2024 13:19

K-os approved these changes Nov 28, 2024

View reviewed changes

adamjw24 merged commit dcd4758 into fraunhoferhhi:master Nov 29, 2024
8 checks passed

georges-arm deleted the geoste01/fastFwdCore_2D-sve branch November 29, 2024 16:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add AArch64 SVE implementation for TCoeffOps fastFwdCore_2D #474

Add AArch64 SVE implementation for TCoeffOps fastFwdCore_2D #474

georges-arm commented Nov 25, 2024 •

edited

Loading

K-os left a comment

Add AArch64 SVE implementation for TCoeffOps fastFwdCore_2D #474

Add AArch64 SVE implementation for TCoeffOps fastFwdCore_2D #474

Conversation

georges-arm commented Nov 25, 2024 • edited Loading

Wire up CMake to enable compiling files with AArch64 SVE/SVE2

Add AArch64 SVE implementation for TCoeffOps fastFwdCore_2D

K-os left a comment

Choose a reason for hiding this comment

georges-arm commented Nov 25, 2024 •

edited

Loading

Add AArch64 SVE implementation for `TCoeffOps` `fastFwdCore_2D`