[CPU] Support dynamic activation sparsity #27974

usstq · 2024-12-09T09:54:22Z

Details:

Activation sparsity exploit the fact that activations in MLP of LLMs is sparse and input channels of activations with small magnitude can be set as zero with acceptable accuracy-drop.

The distribution of sparse channels of activation is dynamic (only known at runtime) and variates a lot from token to token, thus the optimization opportunity only exists in 2nd token generation process with batch-size fixed to 1 (which is exactly typical use-case for client-side LLM inference), in which case weight memory reading cost corresponding to the skipped input channel can be saved.

The best weight memory layout for this optimization is plain [IC, OC], so weights corresponding to each input channel is dense, the non-sparse input channel can enjoy CPU's HW-prefetcher's boost to continuous stream access. if we use current blocked weight-layout set by oneDNN-fork, the weights from both non-sparse & sparse channels would be mixed together in unit of cache-line, which would hurt performance, both due to unfriendly access pattern to HW-prefetcher & DDR's physical page granularity.

But choose plain [IC,OC] layout poses challenge to 1st token latency because blocked layout is best for 1st-token/compute-bound case, so in this PR, we have to also minimize the degradation of 1st token latency.

Peformance data on i9-13900K

weight-format	number of PHY-cores	prompt-length	master with standard IR (1st/2nd latency)	this PR with sparse IR(--up 0.32 --gate 0.32 --down 0.52) (1st/2nd latency)
INT8_ASYM	2	32	2014(ms)/226(ms)	1830(ms)/190(ms)
INT8_ASYM	2	512	25620(ms)/228(ms)	24990(ms)/165(ms)
INT8_ASYM	4	32	1140(ms)/133(ms)	1030(ms)/104(ms)
INT8_ASYM	4	512	14675(ms)/138(ms)	13496(ms)/109(ms)
INT8_ASYM	8	32	768(ms)/116(ms)	752(ms)/91(ms)
INT8_ASYM	8	512	8756(ms)/138(ms)	8112(ms)/105(ms)
--------	---------	------	-------	----------
INT4_ASYM	2	32	2116(ms)/192(ms)	1689(ms)/139(ms)
INT4_ASYM	2	512	24980(ms)/196(ms)	26143(ms)/146(ms)
INT4_ASYM	4	32	1648(ms)/110(ms)	941(ms)/83(ms)
INT4_ASYM	4	512	14016(ms)/96(ms)	14201(ms)/87(ms)
INT4_ASYM	8	32	906(ms)/70(ms)	534(ms)/59(ms)
INT4_ASYM	8	512	7885(ms)/73(ms)	7896(ms)/62(ms)

we can see that there is no regression in 1st token latency, and ~20% reduction in 2nd token latency.

SIMDJit

In this PR we introduced a new way of writing JIT kernels, which is an enhanced version of existing attempts to making JIT programing more friendly :

these efforts are all focusing on making xbyak based JIT programming a more user friendly as an EDSL, and in this PR I go further along this direction:

shared_ptr based register life-cycle auto-management;
64-bit scalar register C++ expression support
control flow construct like if/for/while/do_while
functional style JIT kernel instead of sub-classing

In future, we can port it to ARM64 & RISC-V, and also try to do another level of abstraction on SIMD vector register to (maybe) make single kernel working on all CPU platform with much less efforts (in term of implementing & porting)

Tickets:

CVS-148374

luo-cheng2021

The new function is really cool to simplify writing jit code: automatically mapping physical register, simulate c-like control flow, expressions for generic register. Maybe these features could be extended to simd register in the future, then we can write jit just like intrinsic!

luo-cheng2021 · 2024-12-27T08:00:16Z

src/plugins/intel_cpu/src/nodes/act_sparse_fc.cpp

+    MemoryPtr m_scales;
+    ActSparseFCNode::Config& m_config;
+
+    void show(const char* name, uint8_t* src, int stride, int rows, int cols) {


To be removed.

Sure, will remove

luo-cheng2021 · 2024-12-27T09:11:10Z