Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add AArch64 SVE implementation for TCoeffOps fastFwdCore_2D #474

Merged

Conversation

georges-arm
Copy link
Contributor

@georges-arm georges-arm commented Nov 25, 2024

A bit more CMake framework code to support building SVE code, followed by an initial bit of SVE code that makes use of this infrastructure. This commit does not attempt to set SVE as a default-enabled target for now.

Wire up CMake to enable compiling files with AArch64 SVE/SVE2

This commit enables compilation for new sve/ and sve2/ subdirectories under CommonLib/arm/, assuming that the appropriate -march=... flags are available.

Introduce a new set_if_compiler_supports_arm_extensions function to set up variables based on whether the compiler supports SVE and SVE2, then use these along with the existing requirements (SVE2 requires SVE to be enabled, SVE requires Neon to be enabled) to determine what features ultimately end up being enabled.

The implementation of the extension checking helper function needs to check (a) whether the flag is supported, and (b) whether code will successfully compile when using that flag. The latter is needed since there are some old versions of LLVM that are missing the arm_neon_sve_bridge.h header, and LLVM currently fails to compile SVE code when targeting Windows.

Add AArch64 SVE implementation for TCoeffOps fastFwdCore_2D

The SVE 16-bit dot-product instructions allow us to accumulate twice as much data per instruction compared to Neon multiply-add instructions, giving a good speedup for the fastFwdCore_2D kernels.

Compared to Neon with a fixed vector length of 128 bits, SVE allows different micro-architectures to expose a number of different vector lengths: 128, 256, 512, 1024, or 2048 bits. To take advantage of this we can rewrite the innermost loop of fastFwdCore_2D to be expressed in terms of the number of vectors to process rather than the number of elements, and then pick the number of iterations at setup-time by inspecting the vector length. This allows us to largely avoid needing an entire set of kernels for each possible vector length.

One caveat to the notion of having completely vector-length agnostic kernels is that when the vector-length is known to be exactly 128-bits (the same as Neon) we can make use of some Neon instructions to speed up processing the data after the accumulation. This is possible since Neon and SVE registers share the low 128-bits of each vector register.

For this commit we have not attempted to add kernels that process less than a full vector's worth of data per inner loop iteration, which would enable using these kernels on machines with very wide vectors (512, 1024, or 2048 bits). This is technically straightforward since SVE supports partial vectors via predication, however there are no known long-vector micro-architectures available at present to justify maintaining such code.

Running a video encoding job on SVE-capable machines using the --preset=fast setting shows the following improvements in reported FPS:

Neoverse V1 (VL=256 bits): ~1.3%
Neoverse V2 (VL=128 bits): ~2.6%

cmake/modules/vvencCompilerSupport.cmake Outdated Show resolved Hide resolved
source/Lib/vvenc/CMakeLists.txt Outdated Show resolved Hide resolved
This commit enables compilation for the new `sve/` and `sve2/`
subdirectories under `CommonLib/arm/`, assuming that the appropriate
`-march=...` flags are available.

Introduce a new `set_if_compiler_supports_arm_extensions` function to
set up variables based on whether the compiler supports SVE and SVE2,
then use these along with the existing requirements (SVE2 requires SVE
to be enabled, SVE requires Neon to be enabled) to determine what
features ultimately end up being enabled.

The implementation of the extension checking helper function needs to
check (a) whether the flag is supported, and (b) whether code will
successfully compile when using that flag. The latter is needed since
there are some old versions of LLVM that are missing the
`arm_neon_sve_bridge.h` header, and LLVM currently fails to compile SVE
code when targeting Windows.
The SVE 16-bit dot-product instructions allow us to accumulate twice as
much data per instruction compared to Neon multiply-add instructions,
giving a good speedup for the fastFwdCore_2D kernels.

Compared to Neon with a fixed vector length of 128 bits, SVE allows
different micro-architectures to expose a number of different vector
lengths: 128, 256, 512, 1024, or 2048 bits. To take advantage of this we
can rewrite the innermost loop of fastFwdCore_2D to be expressed in
terms of the number of vectors to process rather than the number of
elements, and then pick the number of iterations at setup-time by
inspecting the vector length. This allows us to largely avoid needing an
entire set of kernels for each possible vector length.

One caveat to the notion of having completely vector-length agnostic
kernels is that when the vector-length is known to be exactly 128-bits
(the same as Neon) we can make use of some Neon instructions to speed up
processing the data after the accumulation. This is possible since Neon
and SVE registers share the low 128-bits of each vector register.

For this commit we have not attempted to add kernels that process less
than a full vector's worth of data per inner loop iteration, which would
enable using these kernels on machines with very wide vectors (512,
1024, or 2048 bits). This is technically straightforward since SVE
supports partial vectors via predication, however there are no known
long-vector micro-architectures available at present to justify
maintaining such code.

Running a video encoding job on SVE-capable machines using the
--preset=fast setting shows the following improvements in reported FPS:

Neoverse V1 (VL=256 bits): ~1.3%
Neoverse V2 (VL=128 bits): ~2.6%
@georges-arm georges-arm force-pushed the geoste01/fastFwdCore_2D-sve branch from c6a6843 to 4c8bfef Compare November 28, 2024 13:19
Copy link
Contributor

@K-os K-os left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks good now.

@adamjw24 adamjw24 merged commit dcd4758 into fraunhoferhhi:master Nov 29, 2024
8 checks passed
@georges-arm georges-arm deleted the geoste01/fastFwdCore_2D-sve branch November 29, 2024 16:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants