You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am using AMX on a SPR CPU to test LLM inference performance. Now I have problems about the performance of different optimum.intel interfaces.
I am using optimum-intel==1.17.0 and optimum-intel==1.20.0, with: intel_extension_for_pytorch==2.4.0torch==2.4.1transformers==4.41.2.
I am trying to test llama-2-7b for inference, with BF16 precision.
As in #942, I am recommended to use from optimum.intel.pipelines import pipeline instead of from optimum.intel import inference_mode. Then I try to test their performance under the environment above. Result (avg. inference time) as follows. base means without optimum-intel and opt means with.
optimum-intel==1.17
inference_mode (to be removed)
optimum.intel.pipelines (recommended)
base
17.52
17.58
opt
17.31
1.00
optimum-intel==1.20
inference_mode (removed)
optimum.intel.pipelines (recommended)
base
N/A
17.58
opt
N/A
0.49
Using optimum-intel 1.17, I'm able to observe oneDNN primitives used with DNNL_VERBOSE, and control ISA (to control whether to use AMX) with DNNL_MAX_CPU_ISA, under base and opt setup.
Using optimum-intel 1.20, I am still able to do so in base (with native transformers), but failed in opt (optimum.intel.pipelines). These indicates that optimum-intel 1.17 is using oneDNN while optimum-intel 1.20 is not.
In optimum-intel 1.17, do inference_mode and optimum.intel.pipelines apply different optimizations? What's the difference?
It seems that the native transformers 4.41.2 is also using AMX (with oneDNN). How does optimum-intel 1.17's optimum.intel.pipelines better utilize AMX?
It seems that optimum-intel 1.20 doesn't use AMX with oneDNN. Then does it use AMX? If so, what does it use AMX with? libxsmm?
How can I verify that AMX is used by optimum-intel 1.20 and the corresponding native transformer?
As oneDNN is not used, how can I control whether to use AMX and observe the primitives in another library? Any substitute for DNNL_MAX_CPU_ISA?
In optimum-intel 1.17, do inference_mode and optimum.intel.pipelines apply different optimizations? What's the difference?
[Response] inference_mode mainly does the jitting to the model, so leverage compiling tech to fuse ops and get benefit. We think the usage experience is not aligned w/ vanilla transformers code, so changed to IPEXModel, in 1.20 IPEXModel, we inherit all the optimizations from inference_mode, and integrate IPEX custom op to further improve the performance.
It seems that the native transformers 4.41.2 is also using AMX (with oneDNN). How does optimum-intel 1.17's optimum.intel.pipelines better utilize AMX?
[Response] Extra performance are from 2 places: 1. graph fusion from jit 2. IPEX custom op has better AMX efficiency than stock PyTorch's(which is used by transformers).
It seems that optimum-intel 1.20 doesn't use AMX with oneDNN. Then does it use AMX? If so, what does it use AMX with? libxsmm?
[Response] Yes, use oneDNN or libxsmm depends on IPEX's strategy, the principle is to pick the one whose perf is better.
How can I verify that AMX is used by optimum-intel 1.20 and the corresponding native transformer?
[Response] You can use turbostat to check cpu runtime freq, when AMX is used, freq will be lower than not. Put it simple, they are both using AMX, the difference is some IPEX custom ops have better efficiency.
As oneDNN is not used, how can I control whether to use AMX and observe the primitives in another library? Any substitute for DNNL_MAX_CPU_ISA?
[Response] Maybe you can try LIBXSMM_TARGET.
I am using AMX on a SPR CPU to test LLM inference performance. Now I have problems about the performance of different optimum.intel interfaces.
I am using
optimum-intel==1.17.0
andoptimum-intel==1.20.0
, with:intel_extension_for_pytorch==2.4.0
torch==2.4.1
transformers==4.41.2
.I am trying to test llama-2-7b for inference, with BF16 precision.
As in #942, I am recommended to use
from optimum.intel.pipelines import pipeline
instead offrom optimum.intel import inference_mode
. Then I try to test their performance under the environment above. Result (avg. inference time) as follows.base
means withoutoptimum-intel
andopt
means with.Using
optimum-intel 1.17
, I'm able to observe oneDNN primitives used withDNNL_VERBOSE
, and control ISA (to control whether to use AMX) withDNNL_MAX_CPU_ISA
, underbase
andopt
setup.Using
optimum-intel 1.20
, I am still able to do so inbase
(with native transformers), but failed inopt
(optimum.intel.pipelines). These indicates thatoptimum-intel 1.17
is using oneDNN whileoptimum-intel 1.20
is not.I learned that optimum-intel is partly based on intel-extension-for-pytorch (IPEX), and that IPEX uses libxsmm to use AMX (see in intel/intel-extension-for-pytorch#517 and intel/intel-extension-for-pytorch#720).
My questions are:
optimum-intel 1.17
, doinference_mode
andoptimum.intel.pipelines
apply different optimizations? What's the difference?optimum-intel 1.17
'soptimum.intel.pipelines
better utilize AMX?optimum-intel 1.20
doesn't use AMX with oneDNN. Then does it use AMX? If so, what does it use AMX with? libxsmm?optimum-intel 1.20
and the corresponding native transformer?DNNL_MAX_CPU_ISA
?Many many thanks!
Code as follows:
The text was updated successfully, but these errors were encountered: