Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CPU] Support dynamic activation sparsity #27974

Open
wants to merge 37 commits into
base: master
Choose a base branch
from

Conversation

usstq
Copy link
Contributor

@usstq usstq commented Dec 9, 2024

Details:

Activation sparsity exploit the fact that activations in MLP of LLMs is sparse and input channels of activations with small magnitude can be set as zero with acceptable accuracy-drop.

The distribution of sparse channels of activation is dynamic (only known at runtime) and variates a lot from token to token, thus the optimization opportunity only exists in 2nd token generation process with batch-size fixed to 1 (which is exactly typical use-case for client-side LLM inference), in which case weight memory reading cost corresponding to the skipped input channel can be saved.

The best weight memory layout for this optimization is plain [IC, OC], so weights corresponding to each input channel is dense, the non-sparse input channel can enjoy CPU's HW-prefetcher's boost to continuous stream access. if we use current blocked weight-layout set by oneDNN-fork, the weights from both non-sparse & sparse channels would be mixed together in unit of cache-line, which would hurt performance, both due to unfriendly access pattern to HW-prefetcher & DDR's physical page granularity.

But choose plain [IC,OC] layout poses challenge to 1st token latency because blocked layout is best for 1st-token/compute-bound case, so in this PR, we have to also minimize the degradation of 1st token latency.

Peformance data on i9-13900K

weight-format number of PHY-cores prompt-length master with standard IR (1st/2nd latency) this PR with sparse IR(--up 0.32 --gate 0.32 --down 0.52) (1st/2nd latency)
INT8_ASYM 2 32 2014(ms)/226(ms) 1830(ms)/190(ms)
INT8_ASYM 2 512 25620(ms)/228(ms) 24990(ms)/165(ms)
INT8_ASYM 4 32 1140(ms)/133(ms) 1030(ms)/104(ms)
INT8_ASYM 4 512 14675(ms)/138(ms) 13496(ms)/109(ms)
INT8_ASYM 8 32 768(ms)/116(ms) 752(ms)/91(ms)
INT8_ASYM 8 512 8756(ms)/138(ms) 8112(ms)/105(ms)
-------- --------- ------ ------- ----------
INT4_ASYM 2 32 2116(ms)/192(ms) 1689(ms)/139(ms)
INT4_ASYM 2 512 24980(ms)/196(ms) 26143(ms)/146(ms)
INT4_ASYM 4 32 1648(ms)/110(ms) 941(ms)/83(ms)
INT4_ASYM 4 512 14016(ms)/96(ms) 14201(ms)/87(ms)
INT4_ASYM 8 32 906(ms)/70(ms) 534(ms)/59(ms)
INT4_ASYM 8 512 7885(ms)/73(ms) 7896(ms)/62(ms)

we can see that there is no regression in 1st token latency, and ~20% reduction in 2nd token latency.

SIMDJit

In this PR we introduced a new way of writing JIT kernels, which is an enhanced version of existing attempts to making JIT programing more friendly :

these efforts are all focusing on making xbyak based JIT programming a more user friendly as an EDSL, and in this PR I go further along this direction:

  • shared_ptr based register life-cycle auto-management;
  • 64-bit scalar register C++ expression support
  • control flow construct like if/for/while/do_while
  • functional style JIT kernel instead of sub-classing

In future, we can port it to ARM64 & RISC-V, and also try to do another level of abstraction on SIMD vector register to (maybe) make single kernel working on all CPU platform with much less efforts (in term of implementing & porting)

Tickets:

@github-actions github-actions bot added category: CPU OpenVINO CPU plugin category: build OpenVINO cmake script / infra labels Dec 9, 2024
@github-actions github-actions bot removed the category: build OpenVINO cmake script / infra label Dec 18, 2024
@usstq usstq marked this pull request as ready for review December 19, 2024 13:38
@usstq usstq requested review from a team as code owners December 19, 2024 13:38
@usstq usstq requested a review from luo-cheng2021 December 27, 2024 01:36
Copy link
Contributor

@luo-cheng2021 luo-cheng2021 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new function is really cool to simplify writing jit code: automatically mapping physical register, simulate c-like control flow, expressions for generic register. Maybe these features could be extended to simd register in the future, then we can write jit just like intrinsic!

MemoryPtr m_scales;
ActSparseFCNode::Config& m_config;

void show(const char* name, uint8_t* src, int stride, int rows, int cols) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will remove

template <class T>
T* ActSparseFcKernel::scratch_alloc(size_t cnt) {
# if defined(__GNUC__) || defined(__clang__)
thread_local uint8_t scratch[1024 * 1024 * 2] __attribute__((aligned(4096)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd better to use the scratch in the GraphContext.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

int OC,
int n0,
int n1) {
static auto repack_2xsimdw = jit_compile_repack_2xsimdw(WeightCompressionType::FP16);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd better to add to the primitive cache.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added primitive cache support

auto jit = std::make_shared<SIMDJit>(__func__);
auto simd_width = SIMDJit::vmm_width<float>();

auto zp_input_u8 = jit->get_sreg(0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The input parameters of a jit function are different from normal variables, we'd better to add a new function to get the input parameters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will add get_arg()

const float* scales,
const uint8_t* zp) {
const auto SIMDW = SIMDJit::vmm_width<float>();
if (OC % (2 * SIMDW)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OPENVINO_ASSERT should be simpler.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will change

return jit;
}

static void gemm6x2_Mx2(const float* pA,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the meaning of x2 in Mx2?

Copy link
Contributor Author

@usstq usstq Dec 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

x2 means 2 SIMD register width in unit of fp32, for example in AVX2 cases, x2 means 16 fp32, in AVX512, it means 32 fp32s

@github-actions github-actions bot added the category: build OpenVINO cmake script / infra label Dec 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: build OpenVINO cmake script / infra category: CPU OpenVINO CPU plugin
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants