Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance and AMX utilization questions with optimum-intel 1.17 and 1.20 for LLM Inference on SPR CPU #946

Open
zsym-sjtu opened this issue Oct 12, 2024 · 1 comment
Assignees

Comments

@zsym-sjtu
Copy link

zsym-sjtu commented Oct 12, 2024

I am using AMX on a SPR CPU to test LLM inference performance. Now I have problems about the performance of different optimum.intel interfaces.

I am using optimum-intel==1.17.0 and optimum-intel==1.20.0, with:
intel_extension_for_pytorch==2.4.0 torch==2.4.1 transformers==4.41.2.

I am trying to test llama-2-7b for inference, with BF16 precision.

As in #942, I am recommended to use from optimum.intel.pipelines import pipeline instead of from optimum.intel import inference_mode. Then I try to test their performance under the environment above. Result (avg. inference time) as follows. base means without optimum-intel and opt means with.

optimum-intel==1.17 inference_mode (to be removed) optimum.intel.pipelines (recommended)
base 17.52 17.58
opt 17.31 1.00
optimum-intel==1.20 inference_mode (removed) optimum.intel.pipelines (recommended)
base N/A 17.58
opt N/A 0.49

Using optimum-intel 1.17, I'm able to observe oneDNN primitives used with DNNL_VERBOSE, and control ISA (to control whether to use AMX) with DNNL_MAX_CPU_ISA, under base and opt setup.
Using optimum-intel 1.20, I am still able to do so in base (with native transformers), but failed in opt (optimum.intel.pipelines). These indicates that optimum-intel 1.17 is using oneDNN while optimum-intel 1.20 is not.

I learned that optimum-intel is partly based on intel-extension-for-pytorch (IPEX), and that IPEX uses libxsmm to use AMX (see in intel/intel-extension-for-pytorch#517 and intel/intel-extension-for-pytorch#720).

My questions are:

  1. In optimum-intel 1.17, do inference_mode and optimum.intel.pipelines apply different optimizations? What's the difference?
  2. It seems that the native transformers 4.41.2 is also using AMX (with oneDNN). How does optimum-intel 1.17's optimum.intel.pipelines better utilize AMX?
  3. It seems that optimum-intel 1.20 doesn't use AMX with oneDNN. Then does it use AMX? If so, what does it use AMX with? libxsmm?
  4. How can I verify that AMX is used by optimum-intel 1.20 and the corresponding native transformer?
  5. As oneDNN is not used, how can I control whether to use AMX and observe the primitives in another library? Any substitute for DNNL_MAX_CPU_ISA?

Many many thanks!

Code as follows:

# inference_mode (to be removed)
from transformers import pipeline
from optimum.intel import inference_mode

pipe = pipeline("text-generation",
        model=model_id,
        torch_dtype=torch.bfloat16
        )

if sys.argv[1] == 'base':
    result = benchmark(pipe)
elif sys.argv[1] == 'opt':
    with inference_mode(pipe, dtype=torch.bfloat16, jit=True) as opt_pipe:
        result = benchmark(opt_pipe)
# optimum.intel.pipelines (recommended)
from transformers.pipelines import pipeline as transformers_pipeline
from optimum.intel.pipelines import pipeline as ipex_pipeline

if sys.argv[1] == 'base':
    pipe = transformers_pipeline("text-generation", model_id, torch_dtype=torch.bfloat16)
elif sys.argv[1] == 'opt':
    pipe = ipex_pipeline("text-generation", model_id, accelerator="ipex", torch_dtype=torch.bfloat16)

with torch.inference_mode():
    result = benchmark(pipe)
@yao-matrix
Copy link

  1. In optimum-intel 1.17, do inference_mode and optimum.intel.pipelines apply different optimizations? What's the difference?
    [Response] inference_mode mainly does the jitting to the model, so leverage compiling tech to fuse ops and get benefit. We think the usage experience is not aligned w/ vanilla transformers code, so changed to IPEXModel, in 1.20 IPEXModel, we inherit all the optimizations from inference_mode, and integrate IPEX custom op to further improve the performance.

  2. It seems that the native transformers 4.41.2 is also using AMX (with oneDNN). How does optimum-intel 1.17's optimum.intel.pipelines better utilize AMX?
    [Response] Extra performance are from 2 places: 1. graph fusion from jit 2. IPEX custom op has better AMX efficiency than stock PyTorch's(which is used by transformers).

  3. It seems that optimum-intel 1.20 doesn't use AMX with oneDNN. Then does it use AMX? If so, what does it use AMX with? libxsmm?
    [Response] Yes, use oneDNN or libxsmm depends on IPEX's strategy, the principle is to pick the one whose perf is better.

  4. How can I verify that AMX is used by optimum-intel 1.20 and the corresponding native transformer?
    [Response] You can use turbostat to check cpu runtime freq, when AMX is used, freq will be lower than not. Put it simple, they are both using AMX, the difference is some IPEX custom ops have better efficiency.

  5. As oneDNN is not used, how can I control whether to use AMX and observe the primitives in another library? Any substitute for DNNL_MAX_CPU_ISA?
    [Response] Maybe you can try LIBXSMM_TARGET.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants