Performance and AMX utilization questions with optimum-intel 1.17 and 1.20 for LLM Inference on SPR CPU #946

zsym-sjtu · 2024-10-12T10:12:16Z

I am using AMX on a SPR CPU to test LLM inference performance. Now I have problems about the performance of different optimum.intel interfaces.

I am using optimum-intel==1.17.0 and optimum-intel==1.20.0, with:
intel_extension_for_pytorch==2.4.0 torch==2.4.1 transformers==4.41.2.

I am trying to test llama-2-7b for inference, with BF16 precision.

As in #942, I am recommended to use from optimum.intel.pipelines import pipeline instead of from optimum.intel import inference_mode. Then I try to test their performance under the environment above. Result (avg. inference time) as follows. base means without optimum-intel and opt means with.

optimum-intel==1.17	inference_mode (to be removed)	optimum.intel.pipelines (recommended)
base	17.52	17.58
opt	17.31	1.00

optimum-intel==1.20	inference_mode (removed)	optimum.intel.pipelines (recommended)
base	N/A	17.58
opt	N/A	0.49

Using optimum-intel 1.17, I'm able to observe oneDNN primitives used with DNNL_VERBOSE, and control ISA (to control whether to use AMX) with DNNL_MAX_CPU_ISA, under base and opt setup.
Using optimum-intel 1.20, I am still able to do so in base (with native transformers), but failed in opt (optimum.intel.pipelines). These indicates that optimum-intel 1.17 is using oneDNN while optimum-intel 1.20 is not.

I learned that optimum-intel is partly based on intel-extension-for-pytorch (IPEX), and that IPEX uses libxsmm to use AMX (see in intel/intel-extension-for-pytorch#517 and intel/intel-extension-for-pytorch#720).

My questions are:

In optimum-intel 1.17, do inference_mode and optimum.intel.pipelines apply different optimizations? What's the difference?
It seems that the native transformers 4.41.2 is also using AMX (with oneDNN). How does optimum-intel 1.17's optimum.intel.pipelines better utilize AMX?
It seems that optimum-intel 1.20 doesn't use AMX with oneDNN. Then does it use AMX? If so, what does it use AMX with? libxsmm?
How can I verify that AMX is used by optimum-intel 1.20 and the corresponding native transformer?
As oneDNN is not used, how can I control whether to use AMX and observe the primitives in another library? Any substitute for DNNL_MAX_CPU_ISA?

Many many thanks!

Code as follows:

# inference_mode (to be removed)
from transformers import pipeline
from optimum.intel import inference_mode

pipe = pipeline("text-generation",
        model=model_id,
        torch_dtype=torch.bfloat16
        )

if sys.argv[1] == 'base':
    result = benchmark(pipe)
elif sys.argv[1] == 'opt':
    with inference_mode(pipe, dtype=torch.bfloat16, jit=True) as opt_pipe:
        result = benchmark(opt_pipe)

# optimum.intel.pipelines (recommended)
from transformers.pipelines import pipeline as transformers_pipeline
from optimum.intel.pipelines import pipeline as ipex_pipeline

if sys.argv[1] == 'base':
    pipe = transformers_pipeline("text-generation", model_id, torch_dtype=torch.bfloat16)
elif sys.argv[1] == 'opt':
    pipe = ipex_pipeline("text-generation", model_id, accelerator="ipex", torch_dtype=torch.bfloat16)

with torch.inference_mode():
    result = benchmark(pipe)

The text was updated successfully, but these errors were encountered:

yao-matrix · 2024-12-12T07:09:13Z

In optimum-intel 1.17, do inference_mode and optimum.intel.pipelines apply different optimizations? What's the difference?
[Response] inference_mode mainly does the jitting to the model, so leverage compiling tech to fuse ops and get benefit. We think the usage experience is not aligned w/ vanilla transformers code, so changed to IPEXModel, in 1.20 IPEXModel, we inherit all the optimizations from inference_mode, and integrate IPEX custom op to further improve the performance.
It seems that the native transformers 4.41.2 is also using AMX (with oneDNN). How does optimum-intel 1.17's optimum.intel.pipelines better utilize AMX?
[Response] Extra performance are from 2 places: 1. graph fusion from jit 2. IPEX custom op has better AMX efficiency than stock PyTorch's(which is used by transformers).
It seems that optimum-intel 1.20 doesn't use AMX with oneDNN. Then does it use AMX? If so, what does it use AMX with? libxsmm?
[Response] Yes, use oneDNN or libxsmm depends on IPEX's strategy, the principle is to pick the one whose perf is better.
How can I verify that AMX is used by optimum-intel 1.20 and the corresponding native transformer?
[Response] You can use turbostat to check cpu runtime freq, when AMX is used, freq will be lower than not. Put it simple, they are both using AMX, the difference is some IPEX custom ops have better efficiency.
As oneDNN is not used, how can I control whether to use AMX and observe the primitives in another library? Any substitute for DNNL_MAX_CPU_ISA?
[Response] Maybe you can try LIBXSMM_TARGET.

echarlaix assigned jiqing-feng Oct 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance and AMX utilization questions with optimum-intel 1.17 and 1.20 for LLM Inference on SPR CPU #946

Performance and AMX utilization questions with optimum-intel 1.17 and 1.20 for LLM Inference on SPR CPU #946

zsym-sjtu commented Oct 12, 2024 •

edited

Loading

yao-matrix commented Dec 12, 2024

Performance and AMX utilization questions with optimum-intel 1.17 and 1.20 for LLM Inference on SPR CPU #946

Performance and AMX utilization questions with optimum-intel 1.17 and 1.20 for LLM Inference on SPR CPU #946

Comments

zsym-sjtu commented Oct 12, 2024 • edited Loading

yao-matrix commented Dec 12, 2024

zsym-sjtu commented Oct 12, 2024 •

edited

Loading