This is a simple example to show SIMD gains. I try to use SIMD to improve the performance of the system I am currently working on. However, I find there are many different instruction sets in my platform. As a freshman, I was unable to catch the right one. I decide to simply benchmark these different instruction sets based on the workload I need so I can have the best choices.
You can find the SIMD concept here and here. Basically, SIMD means Single Instruction Multiple Data, which means you can use a single instruction to manipulate multiple pairs of data. It is suitable for workloads that needs for or while loop and have no dependency between the target data. SIMD is hardware related, here, we only focus on Intel CPU.
Most modern CPU includes SIMD instructions, we can use them in program languages. These instructions operate on registers. In Intel CPU, there are two series instruction sets, namely SSE and AVX, both have 16 registers. On SSE, call XMM0-XMM15 (four 32-bit single-precision floating point numbers) while on AVX, call YMM0-YMM15, each include 8 32-bit single-precsion or 4 64-bit double-precsision floating point numbers . Both of them have different version of instruction sets.
- SSE
- single precision vectors
- with 60 instructions
- release in 1999
- SSE2
- double precision vectors
- extend SSE to support 6/16/32/64/128 bit vector integer
- 144 instructions
- release in 2000
- double precision vectors
- SSE3
- 32 instruction
- support complex data type
- release in 2004
- SSE4.1
- support video/graphic building block
- release in 2007
- SSE4.2
- support string processing
- release in 2008
- AVX
- support 256 bit vector integer
- release in 2010
- AVX2
- introduce fuse multiply-acumulate (FMA) opeations
- release in 2011
- All CPUs since AMD Carrizo or Intel Haswell support AVX2.
- AVX-512 series
- expand avx to 512-bit support
- release in 2013
- To check what SIMD instruction sets you cpu supports, check the CPU flag
- by
lscpu
orcat /proc/cpuinfo
- check whether the flag includes the key works sse, sse2, sse3, sse4_1 sse4_2, avx, avx2, or avx-512 prefix
- For all APIs you can use, check here
- Understand the APIs
_mm256_add_epi32(__m256i a, __m256i b)
- means use the 256 bit vector, for addition, where the content in the vector will be intercepted as 32-bit integer.
- other APIs have similar meaning, you may check the usage of each APIs in the above link.
- head file
- SSE series you only needs
#include <nmmintrin.h>
- AVX series you only needs
#include <immintrin.h>
- SSE series you only needs
- compiler
- add the compiler flage
- before avx needs not flag
-mavx
for avx-mavx2
for avx2-mfma
for fma
- add the compiler flage
- Target instruction sets
- our system cannot support avx-512byte, so we only care the following sets
- avx for double-based and float-based computation
- avx2 for interge-based computation
- fma for fuse operations
- our system cannot support avx-512byte, so we only care the following sets
- in our system, we only care two types of Workload
- array add: simple addition of each elements between two array
- that is
a[i] + b[i]
- that is
- arrary multiply and add, muliplication of each elements between two array, then do the addition between each elements between the multiplied results and another arrary
- that is
a[i]*b[i] + c[i]
- that is
- array add: simple addition of each elements between two array
- Source code
- test_double.c implements the baseline and SIMD version based on double type
- test_float.c implements the baseline and SIMD version based on float type
- test_int.c implements the baseline and SIMD version based on integer type
- inside each cpp file
add_baseline()
- normal processing for array add
add_avx()
- using avx instructions for array add
addmul_baseline()
- normal processing for arrary multiply and add
addmul_avx()
- using avx instructions for array multiply and array add
addmul_fma()
- using fma instructions for arrary multiply and array add
We repeat each function 10000000 times, vary the array size, and report the performance result.
-
double-based implementation
- array size = 10
-
methods used time speedup add_base 2.83s 1.0 add_avx 2.22s 1.27 addmul_base 3.10s 1.0 addmul_avx 3.09s 1.004 addmul_fma 2.84s 1.09
-
- array size = 24
-
methods used time speedup add_base 7.13s 1.0 add_avx 4.55s 1.56 addmul_base 6.97s 1.0 addmul_avx 7.07s 0.98 addmul_fma 6.20s 1.12
-
- array size = 100
-
methods used time speedup add_base 24.9s 1.0 add_avx 13.6s 1.82 addmul_base 25.4s 1.0 addmul_avx 23.2s 1.1 addmul_fma 19.9s 1.27
-
- array size = 10
-
float-based implementation
- array size = 10
-
methods used time speedup add_base 2.37s 1.0 add_avx 1.27s 1.86 addmul_base 2.87s 1.0 addmul_avx 2.02s 1.42 addmul_fma 1.87s 1.53
-
- array size = 24
-
methods used time speedup add_base 7.07s 1.0 add_avx 2.23s 3.18 addmul_base 7.69s 1.0 addmul_avx 4.64s 1.65 addmul_fma 4.32s 1.78
-
- array size = 100
-
methods used time speedup add_base 24.8s 1.0 add_avx 7.98s 3.11 addmul_base 26.3s 1.0 addmul_avx 17.6s 1.49 addmul_fma 15.9s 1.64
-
- array size = 10
-
int-based implementation
- array size = 10
-
methods used time speedup add_base 2.42s 1.0 add_avx 1.36s 1.77 addmul_base 2.67s 1.0 addmul_avx 1.87s 1.42
-
- array size = 24
-
methods used time speedup add_base 6.96s 1.0 add_avx 2.83s 2.45 addmul_base 7.02s 1.0 addmul_avx 4.48s 1.57
-
- array size = 100
-
methods used time speedup add_base 24.7s 1.0 add_avx 9.21s 1.69 addmul_base 24.8s 1.0 addmul_avx 15.24s 1.62
-
- array size = 10
Based on the evaluational results, we draw the following conclusions:
- float-based and inter-based implementatin achieve almost same speedup, which means the speedup actually depends on number of vector bits
- double-based implementation improves limited performances, by theorey we can improve up to 4x, while in practice only 10-20%
- FMA indeed outperform avx2 for fuse operations
- float-based is the best choice
- improve 2.18x in vector addtion
- improve 78% using FMA for mul_and_add operations