-
Notifications
You must be signed in to change notification settings - Fork 38
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe @zhewang1-intc implemented gemm+layernorm fusion before, is it necessary?
@@ -4,7 +4,7 @@ project(bestla LANGUAGES CXX VERSION 0.1.0) | |||
file(GLOB headers ${PROJECT_NAME}/*.h ${PROJECT_NAME}/*.hpp) | |||
file(GLOB xbyak_headers ${PROJECT_NAME}/xbyak/*.h ${PROJECT_NAME}/xbyak/*.hpp) | |||
|
|||
option(BTLA_USE_OPENMP "Enable OpenMP thread pool" ON) | |||
option(BTLA_USE_OPENMP "Enable OpenMP thread pool" OFF) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we already have a customized threadpool implemented?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's better not to set it ON as default. It can be set in neural_speed as it uses OMP as default.
Yes, it's for ONNX's definition. It can have scale and bias. |
Type of Change
Implement AVX2 and AVX512F Layernormalization in BesTLA.
Use BesTLA kernel if src0 is contiguous.
~3x speedup.
Before:
GPTJ: perf_total_per_op_us[ NORM] = 0.477 ms
LLAMA2: perf_total_per_op_us[ RMS_NORM] = 0.362 ms
After:
GPTJ: perf_total_per_op_us[ NORM] = 0.055 ms
LLAMA2: perf_total_per_op_us[ RMS_NORM] = 0.117 ms