-
Notifications
You must be signed in to change notification settings - Fork 570
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Load/Store for SIMD type wrappers #4288
Conversation
a415398
to
9dd2623
Compare
9dd2623
to
e9cfa58
Compare
In my experience GCC only unrolls loops with a |
We could force the unrolling by using an index sequence and fold expressions. We're using this trick already for the bswap fallback implementation and IIRC compilers even recognized this as an actual bswap. Wrapping that into an "okay-to-use" template, doesn't even lead to unreadable code, and unrolls regardless of the optimization settings (godbolt.org). template <size_t begin, size_t end, std::invocable<size_t> FnT>
requires(end >= begin)
constexpr void unrolled_for(FnT&& fn) {
[&]<size_t... indices>(std::index_sequence<indices...>) {
(fn(indices + begin), ...);
}(std::make_index_sequence<end - begin>());
}
int main() {
unrolled_for<2, 10>([](size_t i) {std::cout << i << '\n'; });
} Frankly, I'm much more concerned about MSVC here. And mostly because the mentioned slowdown seemed to come from the poor optimization of the load/store abstractions. I toyed with its flags a bit but no luck. 😞 |
CI failures are relevant. Easiest fix is probably to move the test to |
e9cfa58
to
5c84388
Compare
I tried adding the same load/store helpers to |
This is an attempt for adding support to
load_le
and/orstore_be
for custom types. Essentially, any custom type can implement adapter methodsstatic T::load_{be/le} -> T
andT::store_{be/le} -> void
to hook into this. Essentially the same concept as we established with_const_time_poison()
.By implementing this for
SIMD_4x32
, combining this with the proposedBufferTransformer
and a little bit of glue, I ended up with this for AES-NI-128 decrypt function (that provides the same functionality as the original implementation):When pulling the
rounds
variable into a template parameter we might even be able to share this implementation for the other AES keylengths.Mini-Benchmark
The generated assembly is exactly the same (for clang 18). GCC 13 is doing an equally good job. No perceivable slowdown on either compiler. MSVC doesn't seem to care to unroll the statically sized for-loops and
perhaps other thingsand in particular seems to have trouble seeing through the load/store magic, resulting in an "amazing" 8x slowdown. 😒 This is really sad!