Load/Store for SIMD type wrappers #4288

reneme · 2024-08-05T13:59:38Z

This is an attempt for adding support to load_le and/or store_be for custom types. Essentially, any custom type can implement adapter methods static T::load_{be/le} -> T and T::store_{be/le} -> void to hook into this. Essentially the same concept as we established with _const_time_poison().

By implementing this for SIMD_4x32, combining this with the proposed BufferTransformer and a little bit of glue, I ended up with this for AES-NI-128 decrypt function (that provides the same functionality as the original implementation):

BOTAN_FUNC_ISA("ssse3,aes") void AES_128::hw_aes_decrypt_n(const uint8_t in[], uint8_t out[], size_t blocks) const {
   constexpr size_t rounds = 10;
   const auto K = load_le<std::array<SIMD_4x32, rounds + 1>>(Botan::as_bytes(std::span{m_DK}));

   constexpr size_t four = 4 * BLOCK_SIZE;
   constexpr size_t one = 1 * BLOCK_SIZE;

   BufferTransformer(std::span{in, blocks * BLOCK_SIZE}, std::span{out, blocks * BLOCK_SIZE})
      .process_blocks_of<four, one>([&](auto i, auto o) {
         constexpr size_t blocks = i.size() / BLOCK_SIZE;
         auto Bs = load_le<std::array<SIMD_4x32, blocks>>(i);

         keyxor_new(K[0], Bs);
         for(size_t round = 1; round != rounds; ++round) {
            aesdec_new(K[round], Bs);
         }
         aesdeclast_new(K[10], Bs);

         store_le(o, Bs);
      });
}

When pulling the rounds variable into a template parameter we might even be able to share this implementation for the other AES keylengths.

Mini-Benchmark

The generated assembly is exactly the same (for clang 18). GCC 13 is doing an equally good job. No perceivable slowdown on either compiler. MSVC doesn't seem to care to unroll the statically sized for-loops and ~~perhaps other things~~ and in particular seems to have trouble seeing through the load/store magic, resulting in an "amazing" 8x slowdown. 😒 This is really sad!

coveralls · 2024-08-05T16:21:05Z

coverage: 91.279% (+0.005%) from 91.274%
when pulling 5c84388 on Rohde-Schwarz:rene/simd_load_store
into 1499274 on randombit:master.

randombit · 2024-08-05T23:45:39Z

GCC 13 is doing an equally good job.

In my experience GCC only unrolls loops with a constexpr or equivalent loop count with -O3 - with -O2 or lower it fails to unroll. That's fine, cause we use -O3 ... except that most Linux distros compile us with their "approved" set of compiler flags which uses -O2 😞

reneme · 2024-08-06T06:02:50Z

with -O3 - with -O2 or lower it fails to unroll.

We could force the unrolling by using an index sequence and fold expressions. We're using this trick already for the bswap fallback implementation and IIRC compilers even recognized this as an actual bswap.

Wrapping that into an "okay-to-use" template, doesn't even lead to unreadable code, and unrolls regardless of the optimization settings (godbolt.org).

template <size_t begin, size_t end, std::invocable<size_t> FnT>
    requires(end >= begin)
constexpr void unrolled_for(FnT&& fn) {
    [&]<size_t... indices>(std::index_sequence<indices...>) {
        (fn(indices + begin), ...);
    }(std::make_index_sequence<end - begin>());
}

int main() {
  unrolled_for<2, 10>([](size_t i) {std::cout << i << '\n'; });
}

Frankly, I'm much more concerned about MSVC here. And mostly because the mentioned slowdown seemed to come from the poor optimization of the load/store abstractions. I toyed with its flags a bit but no luck. 😞

src/tests/test_utils.cpp

randombit · 2024-08-10T12:42:39Z

CI failures are relevant. Easiest fix is probably to move the test to test_simd.cpp

reneme · 2024-08-12T12:24:07Z

I tried adding the same load/store helpers to SIMD_8x32 and SIMD_16x32. Though, this isn't as straight-forward due to the inline ISA annotations. Namely, the load_any wrappers aren't marked with those ISA flags and therefore cannot inline the implementations. I'll leave this as future work here.

reneme force-pushed the rene/simd_load_store branch from a415398 to 9dd2623 Compare August 5, 2024 15:42

reneme mentioned this pull request Aug 5, 2024

Add support for AVX2-VAES #4287

Merged

reneme force-pushed the rene/simd_load_store branch from 9dd2623 to e9cfa58 Compare August 5, 2024 15:49

randombit approved these changes Aug 10, 2024

View reviewed changes

src/tests/test_utils.cpp Outdated Show resolved Hide resolved

Allow free-standing load/store functions on SIMD_4x32

5c84388

reneme force-pushed the rene/simd_load_store branch from e9cfa58 to 5c84388 Compare August 12, 2024 12:20

reneme marked this pull request as ready for review August 12, 2024 12:24

reneme merged commit 5f4c2c3 into randombit:master Aug 13, 2024
40 checks passed

reneme deleted the rene/simd_load_store branch August 13, 2024 09:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Load/Store for SIMD type wrappers #4288

Load/Store for SIMD type wrappers #4288

reneme commented Aug 5, 2024 •

edited

Loading

coveralls commented Aug 5, 2024 •

edited

Loading

randombit commented Aug 5, 2024

reneme commented Aug 6, 2024

randombit commented Aug 10, 2024

reneme commented Aug 12, 2024

Load/Store for SIMD type wrappers #4288

Load/Store for SIMD type wrappers #4288

Conversation

reneme commented Aug 5, 2024 • edited Loading

Mini-Benchmark

coveralls commented Aug 5, 2024 • edited Loading

randombit commented Aug 5, 2024

reneme commented Aug 6, 2024

randombit commented Aug 10, 2024

reneme commented Aug 12, 2024

reneme commented Aug 5, 2024 •

edited

Loading

coveralls commented Aug 5, 2024 •

edited

Loading