Model patcher #567

jiqing-feng · 2024-02-19T06:07:18Z

This PR enables ipex llama model by patching functions and classes, and it has 30% speed-up than the original optimization.
The ipex optimization ops will be released soon, and I will add the CI tests once it released. We can focus on the integration for now.

BTW, this PR includes #566, and I will rebase it after #566 is merged.

HuggingFaceDocBuilderDev · 2024-02-19T06:12:35Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

optimum/exporters/ipex/llama_functions.py

optimum/intel/ipex/modeling_base.py

optimum/exporters/ipex/model_patcher.py

optimum/exporters/ipex/llama_functions.py

optimum/intel/ipex/modeling_base.py

optimum/exporters/ipex/model_patcher.py

optimum/exporters/ipex/__init__.py

optimum/exporters/ipex/model_patcher.py

jiqing-feng · 2024-02-28T07:55:30Z

Hi @echarlaix Thanks for your review. I have fixed most of them and still have some issues with the comments and tests, would you please take the next round of review? Thx!

echarlaix

Thanks a lot for iterating on the PR @jiqing-feng

optimum/exporters/ipex/llama_functions.py

optimum/exporters/ipex/model_patcher.py

optimum/intel/ipex/modeling_base.py

jiqing-feng · 2024-03-01T06:46:35Z

Hi @echarlaix . I think I have fixed all your comments, and I also added ipex model generation testing with multi inputs. Would you please help to review it? Thx!

optimum/intel/ipex/modeling_base.py

optimum/exporters/ipex/model_patcher.py

optimum/exporters/ipex/modeling_utils.py

optimum/intel/ipex/modeling_base.py

tests/ipex/test_modeling.py

jiqing-feng · 2024-03-04T02:41:25Z

Hi @echarlaix , I have fixed all your comments. Do you mind taking the last round? I think we could merge it first since it does not affect the current version, and will let you know once the new version ipex is released. Thx!

tests/ipex/test_modeling.py

optimum/intel/ipex/modeling_base.py

optimum/exporters/ipex/modeling_utils.py

echarlaix · 2024-03-05T16:44:41Z

also cc @ofirzaf for visibility

jiqing-feng · 2024-03-06T02:47:18Z

Hi @echarlaix , thanks for your detailed review!

I have disabled everything for the next ipex release, you can check on my codes and see that the PR will not make any change if the ipex version <= 2.3.0.

Although the PR works well with our internal ipex version (2.3.0.dev), I will double-check it and make it compatible when the public ipex 2.3.0 is released.

I think it is ready to merge, would like to hear your opinion @ofirzaf . Thx!

echarlaix · 2024-03-06T10:24:18Z

tests/ipex/test_modeling.py

@@ -128,7 +131,7 @@ def test_compare_to_transformers(self, model_arch):
        outputs = ipex_model(**tokens)
        # Compare tensor outputs
        for output_name in {"logits", "last_hidden_state"}:
-            if output_name in transformers_outputs:
+            if output_name in transformers_outputs and output_name in outputs:


Suggested change

if output_name in transformers_outputs and output_name in outputs:

if output_name in transformers_outputs:

we should have the same output for both so we need to keep as is

Actually IPEXModel didn't return last_hidden_states when return_dict=False. See here. return_dict=False is needed by jit trace.

tests/ipex/test_modeling.py

jiqing-feng · 2024-02-28T07:47:00Z

optimum/intel/ipex/modeling_base.py

+
+
+def ipex_jit_trace(model, task, use_cache):
+    if version.parse(ipex.__version__) <= version.parse("2.3.0") or not is_model_support_ipex_export(model, task):


Reply to the comment, I checked ipex version here, so if ipex.version <= 2.3.0, it will not use exports.ipex functions.

optimum/exporters/ipex/model_patcher.py

optimum/exporters/ipex/llama_functions.py

optimum/intel/ipex/modeling_base.py

tests/ipex/test_modeling.py

optimum/exporters/ipex/modeling_utils.py

optimum/intel/ipex/modeling_base.py

tests/ipex/test_modeling.py

ofirzaf · 2024-03-07T17:33:20Z

Isn't all this model patching suppose to be taken care by ipex.llm.optimize?

Co-authored-by: Ella Charlaix <[email protected]>

jiqing-feng · 2024-03-08T01:15:59Z

Isn't all this model patching suppose to be taken care by ipex.llm.optimize?

Hi @ofirzaf . The utility of ipex is to supply basic ops for the model just like pytorch. ipex.llm.optimize is just for our internal test and it does not compatible with the current transformer version.

jiqing-feng · 2024-03-08T04:36:44Z

Hi @echarlaix . I have applied all your changes, thanks for that. The failed CI was caused by #589 , I think we could fix it soon. And this PR should be ready to merge : )

* llama model patcher * fix jit model * fix jit model * rm autocast in model * add llama model patcher * support assisted decoding and add reorder cache function * add comment for _prepare_past_key_values * rebase main * fix model_dtype * rm useless comments * fix llama * add comments for ipex_rope and ipex_scale_dot_product * fix comments * add enable_tpp comments * fix import * fix review aroun2 * add torch.no_grad to avoid auto_kernel_selection issue * use torch.no_grad in jit trace * fix ipex model testing * add tests for ipex model generation with multi inputs * fix code style * remove __get__(self) as _reorder_cache is static method for the class * fix reorder_cache * use model_type * check if reorder_cache is a static method * fix _reorder_cache * fix raise import error * test ipex patching * fix comments * update API name and testing * disable untill ipex version 2.5.0 * update testing name * Update optimum/intel/ipex/modeling_base.py Co-authored-by: Ella Charlaix <[email protected]> * Update tests/ipex/test_modeling.py Co-authored-by: Ella Charlaix <[email protected]> * fix tests --------- Co-authored-by: Ella Charlaix <[email protected]>

…tension-for-transformers. (#455) * Support weight-only quantization with quantized operators in intel-extension-for-transformers * Update code style * Update readme for weight-only quantization example * Update code * Adapt intel-extension-for-transformers 1.3 API change Signed-off-by: Cheng, Penghui <[email protected]> * Support weight-only quantization with quantized operators in intel-extension-for-transformers * Update code * rebase code on main branch Signed-off-by: Cheng, Penghui <[email protected]> * Update example Signed-off-by: Cheng, Penghui <[email protected]> * Update optimum/intel/neural_compressor/quantization.py Co-authored-by: Ella Charlaix <[email protected]> * [OV]: Fixed inference after 4 bit weight compression (#569) * [OV]: Fixed inferece after 4 bit weight compression * Fixed issue * Update optimum/intel/openvino/modeling_decoder.py Co-authored-by: Ella Charlaix <[email protected]> * Applied comments * Fixed issue when request is None --------- Co-authored-by: Ella Charlaix <[email protected]> * Updated docs with load_in_4bit (#558) * Updated docs with load_in_4bit * Update documentation * Update documentation * typo --------- Co-authored-by: Ella Charlaix <[email protected]> * Update Transformers dependency requirements (#571) * Fix compatibility for latest transformers release (#570) * fix compatibility for latest transformers release * update setup * update setup * fix test input size * fix prepare generation for llama models * Deprecate compression options (#565) * deprecate compression options * style * fix configuration * Update CLI argument * update documentation * deprecate torch nn modules for ov quantizer * fix ov config for fp32 models * fix format * update documentation * Add check for configuration * fix ratio default value for SD models * add quantization_config argument for OVModel * remove commented line * Update docs/source/inference.mdx Co-authored-by: Alexander Kozlov <[email protected]> * add default config for causal LM * fix warning message --------- Co-authored-by: Alexander Kozlov <[email protected]> * Add default quantization int4 config for Mixtral-8x7B (#576) * Update stable diffusion example requirements (#579) * Fix collecting duplicate tensors in quantization calibration dataset (#577) * Added deepcopying of inputs collected by InferRequestWrapper. Added a test covering the fixed issue. * Phrasing tweaks * Add soundfile to test requirements * Added librosa to test requirements * Added copying to other data cache appends * Remove the need for real test data * Process __call__ call properly * Addressed suggested changes * Save an openvino config summarizing all information related to quantization when saving model (#578) * fix doc * remove default compression value * set default compression config when not provided * save openvino config to include quantization configuration * fix style * add test * update setup * style * remove from quantization_config key from ov_config * add test * update setup * modify method name * Fix warning (#582) * Fix warning * fix message warning * Add reference to the temporary directory for windows fix (#581) * Fix documentation (#583) * Fix documentation * fix * Add llama test model to cover MQA (#585) * change llama test model to cover MQA * keep llama and llama2 in tests * fix code style * Include nncf in openvino extra (#586) * Fix title documentation (#588) * Update OpenVINO documentation links in README.md (#587) * Update OpenVINO documentation links in README.md The links are now aligned with OpenVINO 2024.0 documentation, and include permalinks instead of direct links, when possible. * Update inference.mdx * Update index.mdx * Update installation.mdx * Update README.md * Fix default int8 quantization for CLI (#592) * Change model output parameter to last_hidden_states for IPEXModel (#589) * change model output parameter to last_hidden_states * update ipex model testiong * update testing * add output name to ipex model * Add IPEX model patcher (#567) * llama model patcher * fix jit model * fix jit model * rm autocast in model * add llama model patcher * support assisted decoding and add reorder cache function * add comment for _prepare_past_key_values * rebase main * fix model_dtype * rm useless comments * fix llama * add comments for ipex_rope and ipex_scale_dot_product * fix comments * add enable_tpp comments * fix import * fix review aroun2 * add torch.no_grad to avoid auto_kernel_selection issue * use torch.no_grad in jit trace * fix ipex model testing * add tests for ipex model generation with multi inputs * fix code style * remove __get__(self) as _reorder_cache is static method for the class * fix reorder_cache * use model_type * check if reorder_cache is a static method * fix _reorder_cache * fix raise import error * test ipex patching * fix comments * update API name and testing * disable untill ipex version 2.5.0 * update testing name * Update optimum/intel/ipex/modeling_base.py Co-authored-by: Ella Charlaix <[email protected]> * Update tests/ipex/test_modeling.py Co-authored-by: Ella Charlaix <[email protected]> * fix tests --------- Co-authored-by: Ella Charlaix <[email protected]> * Updates weight quantization section in the docs (#593) * Remove accelerate and onnxruntime from required dependencies (#590) * Remove accelerate dependency * Add accelerate to import backend mapping * Add eval method to OVModels * add onnxruntime install for OV test * fix test expected int8 * Fix OpenVINO image classification examples (#598) * Fix weights compression for OPenVINO models (#596) * hot fix for weights compression * rewrite mcok tests * Fix default ov config (#600) * Add warning for transformers>=4.38 and OpenVINO 2024.0 (#599) * Add warning for transformers>=4.38 and OpenVINO 2024.0 * Use is_openvino_version to compare versions * Show version warning only for llama and gpt-bigcode * Fix style, show OpenVINO version * Include affected model types in warning message * Add hybrid quantization for StableDiffusion pipelines (#584) * Add hybrid quantization for StableDiffusion pipelines * apply black * fix tests * fix ruff * fix lcm bug * apply review comments * rework dataset processing * Add doc * remove SDXL test * Apply comments * reformat * Show device name in _print_compiled_model_properties (#541) * Show device name in _print_compiled_model_properties Enable CACHE_DIR also for devices like "GPU:0" * Update optimum/intel/openvino/modeling_seq2seq.py Co-authored-by: Ella Charlaix <[email protected]> * Change check for gpu device --------- Co-authored-by: Ella Charlaix <[email protected]> * Update code with comments Signed-off-by: Cheng, Penghui <[email protected]> * Fixed pylint error Signed-off-by: Cheng, Penghui <[email protected]> * Update optimum/intel/neural_compressor/configuration.py Co-authored-by: Ella Charlaix <[email protected]> * Fixed example and UT for weight-only quantization Signed-off-by: Cheng, Penghui <[email protected]> * Fixed pre-ci test error Signed-off-by: Cheng, Penghui <[email protected]> * Fixed pre-ci test error Signed-off-by: Cheng, Penghui <[email protected]> * Fixed UT and examples error Signed-off-by: Cheng, Penghui <[email protected]> * Fixed pre-CI error Signed-off-by: Cheng, Penghui <[email protected]> * Fixed UT error Signed-off-by: Cheng, Penghui <[email protected]> * Update tests/openvino/test_modeling_basic.py Co-authored-by: Ella Charlaix <[email protected]> * Update examples/neural_compressor/language-modeling/README.md Co-authored-by: Ella Charlaix <[email protected]> * Update examples/neural_compressor/language-modeling/run_clm.py Co-authored-by: Ella Charlaix <[email protected]> * Update examples/neural_compressor/language-modeling/run_clm.py Co-authored-by: Ella Charlaix <[email protected]> * Update examples/neural_compressor/language-modeling/run_clm.py Co-authored-by: Ella Charlaix <[email protected]> * Update examples/neural_compressor/language-modeling/run_clm.py Co-authored-by: Ella Charlaix <[email protected]> * Update examples/neural_compressor/language-modeling/run_clm.py Co-authored-by: Ella Charlaix <[email protected]> * Load weight-only quantized model with INCModelForCausalLM Signed-off-by: Cheng, Penghui <[email protected]> * Changed parameters name for GPTQ in example Signed-off-by: Cheng, Penghui <[email protected]> * Changed parameters order in INCQuantizer.quantize Signed-off-by: Cheng, Penghui <[email protected]> * Fixed UT error Signed-off-by: Cheng, Penghui <[email protected]> * Update examples/neural_compressor/text-generation/run_generation.py Co-authored-by: Ella Charlaix <[email protected]> * Update optimum/intel/neural_compressor/quantization.py Co-authored-by: Ella Charlaix <[email protected]> * Update optimum/intel/neural_compressor/quantization.py Co-authored-by: Ella Charlaix <[email protected]> * Update import message Signed-off-by: Cheng, Penghui <[email protected]> * Limit intel-extension-for-transformers version Signed-off-by: Cheng, Penghui <[email protected]> * Limit torch version for weight-only quantization Signed-off-by: Cheng, Penghui <[email protected]> * Fixed doc building error Signed-off-by: Cheng, Penghui <[email protected]> --------- Signed-off-by: Cheng, Penghui <[email protected]> Co-authored-by: Ella Charlaix <[email protected]> Co-authored-by: Alexander Kozlov <[email protected]> Co-authored-by: Ella Charlaix <[email protected]> Co-authored-by: Lyalyushkin Nikolay <[email protected]> Co-authored-by: Helena Kloosterman <[email protected]> Co-authored-by: Nikita Savelyev <[email protected]> Co-authored-by: jiqing-feng <[email protected]> Co-authored-by: Ekaterina Aidova <[email protected]> Co-authored-by: Karol Blaszczak <[email protected]> Co-authored-by: Liubov Talamanova <[email protected]>

jiqing-feng added 4 commits February 18, 2024 11:19

llama model patcher

6c09841

fix jit model

8749a5a

fix jit model

e05557a

rm autocast in model

151712d

echarlaix reviewed Feb 19, 2024

View reviewed changes

jiqing-feng and others added 11 commits February 19, 2024 08:59

add llama model patcher

c81a5f8

support assisted decoding and add reorder cache function

1782a50

Merge branch 'main' into model_patcher

c0c9f5b

add comment for _prepare_past_key_values

41bf0f5

Merge branch 'main' into jit

6509035

rebase main

dd63ee7

fix model_dtype

16706d3

rm useless comments

1244772

merge jit branch

daabe80

fix llama

4c1c636

add comments for ipex_rope and ipex_scale_dot_product

b04b435

jiqing-feng mentioned this pull request Feb 27, 2024

Unpatch llama #572

Closed

echarlaix reviewed Feb 27, 2024

View reviewed changes

optimum/exporters/ipex/model_patcher.py Outdated Show resolved Hide resolved

echarlaix reviewed Feb 27, 2024

View reviewed changes

optimum/exporters/ipex/model_patcher.py Outdated Show resolved Hide resolved

jiqing-feng added 3 commits February 28, 2024 10:08

fix comments

38ed051

add enable_tpp comments

0dbde50

fix import

e5b7afd

echarlaix reviewed Feb 29, 2024

View reviewed changes

jiqing-feng added 3 commits March 1, 2024 06:24

fix review aroun2

eb6ab6a

add torch.no_grad to avoid auto_kernel_selection issue

41ca8c4

use torch.no_grad in jit trace

7b67c1f

echarlaix reviewed Mar 1, 2024

View reviewed changes

tests/ipex/test_modeling.py Outdated Show resolved Hide resolved

jiqing-feng added 2 commits March 4, 2024 04:27

fix raise import error

3a86c40

test ipex patching

e03259c

echarlaix reviewed Mar 5, 2024

View reviewed changes

optimum/exporters/ipex/modeling_utils.py Outdated Show resolved Hide resolved

echarlaix reviewed Mar 6, 2024

View reviewed changes

tests/ipex/test_modeling.py Show resolved Hide resolved

jiqing-feng and others added 2 commits March 6, 2024 05:37

fix comments

4c3335b

Merge branch 'huggingface:main' into model_patcher

b1f704a

jiqing-feng commented Mar 7, 2024

View reviewed changes

jiqing-feng and others added 4 commits March 7, 2024 12:47

Merge branch 'huggingface:main' into model_patcher

4e0ec0a

update API name and testing

aa3008f

disable untill ipex version 2.5.0

37e8cc4

update testing name

e3a7024

echarlaix reviewed Mar 7, 2024

View reviewed changes

optimum/intel/ipex/modeling_base.py Outdated Show resolved Hide resolved

echarlaix reviewed Mar 7, 2024

View reviewed changes

tests/ipex/test_modeling.py Outdated Show resolved Hide resolved

jiqing-feng and others added 2 commits March 8, 2024 09:13

Update optimum/intel/ipex/modeling_base.py

070a0dc

Co-authored-by: Ella Charlaix <[email protected]>

Update tests/ipex/test_modeling.py

8c60c7a

Co-authored-by: Ella Charlaix <[email protected]>

echarlaix merged commit 6e8cd3d into huggingface:main Mar 8, 2024
6 of 10 checks passed

fix tests

f8d3f74

jiqing-feng deleted the model_patcher branch March 11, 2024 07:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model patcher #567

Model patcher #567

jiqing-feng commented Feb 19, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Feb 19, 2024

jiqing-feng commented Feb 28, 2024 •

edited

Loading

echarlaix left a comment

jiqing-feng commented Mar 1, 2024

jiqing-feng commented Mar 4, 2024

echarlaix commented Mar 5, 2024

jiqing-feng commented Mar 6, 2024

echarlaix Mar 6, 2024

jiqing-feng Mar 7, 2024 •

edited

Loading

jiqing-feng Feb 28, 2024

ofirzaf commented Mar 7, 2024

jiqing-feng commented Mar 8, 2024 •

edited

Loading

jiqing-feng commented Mar 8, 2024

	if output_name in transformers_outputs and output_name in outputs:
	if output_name in transformers_outputs:



		def ipex_jit_trace(model, task, use_cache):
		if version.parse(ipex.__version__) <= version.parse("2.3.0") or not is_model_support_ipex_export(model, task):

Model patcher #567

Model patcher #567

Conversation

jiqing-feng commented Feb 19, 2024 • edited Loading

HuggingFaceDocBuilderDev commented Feb 19, 2024

jiqing-feng commented Feb 28, 2024 • edited Loading

echarlaix left a comment

Choose a reason for hiding this comment

jiqing-feng commented Mar 1, 2024

jiqing-feng commented Mar 4, 2024

echarlaix commented Mar 5, 2024

jiqing-feng commented Mar 6, 2024

echarlaix Mar 6, 2024

Choose a reason for hiding this comment

jiqing-feng Mar 7, 2024 • edited Loading

Choose a reason for hiding this comment

jiqing-feng Feb 28, 2024

Choose a reason for hiding this comment

ofirzaf commented Mar 7, 2024

jiqing-feng commented Mar 8, 2024 • edited Loading

jiqing-feng commented Mar 8, 2024

jiqing-feng commented Feb 19, 2024 •

edited

Loading

jiqing-feng commented Feb 28, 2024 •

edited

Loading

jiqing-feng Mar 7, 2024 •

edited

Loading

jiqing-feng commented Mar 8, 2024 •

edited

Loading