Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Unaligned loads perform well on x86_64
Recently I have been writing quite some vectorized code and I decided to update my very first attempt at the matter. This is certainly much simpler. I did a quick check and pointer types are signed by default. (At least on my platform, intptr_t is a long, not an unsigned one). So deducting from end_ptr as in this code will simply work.
Daniel Lemire did a test and found there is no difference between unaligned and aligned loads: https://lemire.me/blog/2012/05/31/data-alignment-for-speed-myth-or-reality/. This was quite some time ago. I also did some reading lately and I found it confirmed that AMD and Intel specifically altered their architectures to make sure unaligned loads are just as fast. Data alignment is simply not an issue anymore for speed. Difference is not measurable. So unaligned loads are actually faster as you can start using vector instructions right away rather than having the overhead of an alignment loop first.
I did some quick testing and found no speed difference between this code and the old code. This will save quite some lines.