Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load/Store for SIMD type wrappers #4288

Merged
merged 1 commit into from
Aug 13, 2024

Conversation

reneme
Copy link
Collaborator

@reneme reneme commented Aug 5, 2024

This is an attempt for adding support to load_le and/or store_be for custom types. Essentially, any custom type can implement adapter methods static T::load_{be/le} -> T and T::store_{be/le} -> void to hook into this. Essentially the same concept as we established with _const_time_poison().

By implementing this for SIMD_4x32, combining this with the proposed BufferTransformer and a little bit of glue, I ended up with this for AES-NI-128 decrypt function (that provides the same functionality as the original implementation):

BOTAN_FUNC_ISA("ssse3,aes") void AES_128::hw_aes_decrypt_n(const uint8_t in[], uint8_t out[], size_t blocks) const {
   constexpr size_t rounds = 10;
   const auto K = load_le<std::array<SIMD_4x32, rounds + 1>>(Botan::as_bytes(std::span{m_DK}));

   constexpr size_t four = 4 * BLOCK_SIZE;
   constexpr size_t one = 1 * BLOCK_SIZE;

   BufferTransformer(std::span{in, blocks * BLOCK_SIZE}, std::span{out, blocks * BLOCK_SIZE})
      .process_blocks_of<four, one>([&](auto i, auto o) {
         constexpr size_t blocks = i.size() / BLOCK_SIZE;
         auto Bs = load_le<std::array<SIMD_4x32, blocks>>(i);

         keyxor_new(K[0], Bs);
         for(size_t round = 1; round != rounds; ++round) {
            aesdec_new(K[round], Bs);
         }
         aesdeclast_new(K[10], Bs);

         store_le(o, Bs);
      });
}

When pulling the rounds variable into a template parameter we might even be able to share this implementation for the other AES keylengths.

Mini-Benchmark

The generated assembly is exactly the same (for clang 18). GCC 13 is doing an equally good job. No perceivable slowdown on either compiler. MSVC doesn't seem to care to unroll the statically sized for-loops and perhaps other things and in particular seems to have trouble seeing through the load/store magic, resulting in an "amazing" 8x slowdown. 😒 This is really sad!

@reneme reneme force-pushed the rene/simd_load_store branch from a415398 to 9dd2623 Compare August 5, 2024 15:42
@reneme reneme mentioned this pull request Aug 5, 2024
@reneme reneme force-pushed the rene/simd_load_store branch from 9dd2623 to e9cfa58 Compare August 5, 2024 15:49
@coveralls
Copy link

coveralls commented Aug 5, 2024

Coverage Status

coverage: 91.279% (+0.005%) from 91.274%
when pulling 5c84388 on Rohde-Schwarz:rene/simd_load_store
into 1499274 on randombit:master.

@randombit
Copy link
Owner

GCC 13 is doing an equally good job.

In my experience GCC only unrolls loops with a constexpr or equivalent loop count with -O3 - with -O2 or lower it fails to unroll. That's fine, cause we use -O3 ... except that most Linux distros compile us with their "approved" set of compiler flags which uses -O2 😞

@reneme
Copy link
Collaborator Author

reneme commented Aug 6, 2024

with -O3 - with -O2 or lower it fails to unroll.

We could force the unrolling by using an index sequence and fold expressions. We're using this trick already for the bswap fallback implementation and IIRC compilers even recognized this as an actual bswap.

Wrapping that into an "okay-to-use" template, doesn't even lead to unreadable code, and unrolls regardless of the optimization settings (godbolt.org).

template <size_t begin, size_t end, std::invocable<size_t> FnT>
    requires(end >= begin)
constexpr void unrolled_for(FnT&& fn) {
    [&]<size_t... indices>(std::index_sequence<indices...>) {
        (fn(indices + begin), ...);
    }(std::make_index_sequence<end - begin>());
}

int main() {
  unrolled_for<2, 10>([](size_t i) {std::cout << i << '\n'; });
}

Frankly, I'm much more concerned about MSVC here. And mostly because the mentioned slowdown seemed to come from the poor optimization of the load/store abstractions. I toyed with its flags a bit but no luck. 😞

src/tests/test_utils.cpp Outdated Show resolved Hide resolved
@randombit
Copy link
Owner

CI failures are relevant. Easiest fix is probably to move the test to test_simd.cpp

@reneme reneme force-pushed the rene/simd_load_store branch from e9cfa58 to 5c84388 Compare August 12, 2024 12:20
@reneme
Copy link
Collaborator Author

reneme commented Aug 12, 2024

I tried adding the same load/store helpers to SIMD_8x32 and SIMD_16x32. Though, this isn't as straight-forward due to the inline ISA annotations. Namely, the load_any wrappers aren't marked with those ISA flags and therefore cannot inline the implementations. I'll leave this as future work here.

@reneme reneme marked this pull request as ready for review August 12, 2024 12:24
@reneme reneme merged commit 5f4c2c3 into randombit:master Aug 13, 2024
40 checks passed
@reneme reneme deleted the rene/simd_load_store branch August 13, 2024 09:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants