failed to run Llama-2-7b-chat-hf on NPU through Sample/Python #820

aoke79 · 2024-09-04T08:51:49Z

Dears,
I failed to run Llama-2-7b-chat-hf on NPU, please give me a hand.

I converted the mode by below command, and got two models,
a) optimum-cli export openvino --task text-generation -m Meta--Llama-2-7b-chat-hf --weight-format int4_sym_g128 --ratio 1.0 ov--Llama-2-7b-chat-hf-int4-sym-g128
b) optimum-cli export openvino --task text-generation -m Meta--Llama-2-7b-chat-hf --weight-format int4 ov--Llama-2-7b-chat-hf-int4
I used chat_sample, benchmark_genai, beam_search_causal_lm, and got the similar results like:
a) python beam_search_causal_lm.py c:\AIGC\hf\ov--Llama-2-7b-chat-hf-int4-sym-g128 "why the Sun is yellow?"
b) python chat_sample.py c:\AIGC\hf\ov--Llama-2-7b-chat-hf-int4-sym-g128
c) python benchmark_genai.py -m C:\AIGC\openvino\models\ov--Llama-2-7b-chat-hf-int4-sym-g128 -p "why the Sun is yellow?" -nw 1 -n 1 -mt 200 -d CPU

(env_ov_genai) c:\AIGC\openvino\openvino.genai\samples\python\beam_search_causal_lm>python beam_search_causal_lm.py c:\AIGC\hf\ov--Llama-2-7b-chat-hf-int4-sym-g128 "why the Sun is yellow?"
Traceback (most recent call last):
File "c:\AIGC\openvino\openvino.genai\samples\python\beam_search_causal_lm\beam_search_causal_lm.py", line 29, in
main()
File "c:\AIGC\openvino\openvino.genai\samples\python\beam_search_causal_lm\beam_search_causal_lm.py", line 24, in main
beams = pipe.generate(args.prompts, config)
RuntimeError: Exception from src\inference\src\cpp\infer_request.cpp:79:
Check '::getPort(port, name, {_impl->get_inputs(), _impl->get_outputs()})' failed at src\inference\src\cpp\infer_request.cpp:79:
Port for tensor name beam_idx was not found.

(env_ov_genai) c:\AIGC\openvino\openvino.genai\samples\python\benchmark_genai>python benchmark_genai.py -m c:\AIGC\openvino\models\TinyLlama-1.1B-Chat-v1.0\OV_FP16-4BIT_DEFAULT -p "why the Sun is yellow?" -nw 1 -n 1 -mt 200 -d NPU
Traceback (most recent call last):
File "c:\AIGC\openvino\openvino.genai\samples\python\benchmark_genai\benchmark_genai.py", line 49, in
main()
File "c:\AIGC\openvino\openvino.genai\samples\python\benchmark_genai\benchmark_genai.py", line 32, in main
pipe.generate(prompt, config)
RuntimeError: Exception from C:\Jenkins\workspace\private-ci\ie\build-windows-vs2019\b\repos\openvino.genai\src\cpp\src\llm_pipeline_static.cpp:206:
Currently only batch size=1 is supported

(env_ov_genai) c:\AIGC\openvino\openvino.genai\samples\python>python chat_sample.py c:\AIGC\hf\ov--Llama-2-7b-chat-hf-int4-sym-g128
Traceback (most recent call last):
File "c:\AIGC\openvino\openvino.genai\samples\python\chat_sample.py", line 43, in
main()
File "c:\AIGC\openvino\openvino.genai\samples\python\chat_sample.py", line 22, in main
pipe = openvino_genai.LLMPipeline(args.model_dir, device)
RuntimeError: Exception from src\core\src\pass\stateful_to_stateless.cpp:128:
Stateful models without beam_idx input are not supported in StatefulToStateless transformation

I'm not sure if I converted the correct model, so I generated two models like above command line, but neither of them worked.
might you please show me how to do that?
Thanks a lot

The text was updated successfully, but these errors were encountered:

aoke79 · 2024-09-04T08:58:01Z

pip-list.txt
attach the pip list FYI.
thanks

aoke79 · 2024-09-09T07:52:11Z

Can anyone please take a look at this issue?
thanks

Wovchena · 2024-09-09T08:00:43Z

--task is incorrect for optimum-cli. Try text-generation-with-past or don't specify it at all.

aoke79 · 2024-09-10T02:36:42Z

if removed --task text-generation, will show below comments:

optimum-cli export openvino -m Meta--Llama-2-7b-chat-hf --weight-format int4 ov--Llama-2-7b-chat-hf-int4
Traceback (most recent call last):
File "", line 198, in run_module_as_main
File "", line 88, in run_code
File "C:\ProgramData\anaconda3\envs\env_ov_optimum\Scripts\optimum-cli.exe_main.py", line 7, in
File "C:\ProgramData\anaconda3\envs\env_ov_optimum\Lib\site-packages\optimum\commands\optimum_cli.py", line 208, in main
service.run()
File "C:\ProgramData\anaconda3\envs\env_ov_optimum\Lib\site-packages\optimum\commands\export\openvino.py", line 304, in run
task = infer_task(self.args.task, self.args.model)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ProgramData\anaconda3\envs\env_ov_optimum\Lib\site-packages\optimum\exporters\openvino_main.py", line 54, in infer_task
task = TasksManager.infer_task_from_model(model_name_or_path)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ProgramData\anaconda3\envs\env_ov_optimum\Lib\site-packages\optimum\exporters\tasks.py", line 1680, in infer_task_from_model
task = cls._infer_task_from_model_name_or_path(model, subfolder=subfolder, revision=revision)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\ProgramData\anaconda3\envs\env_ov_optimum\Lib\site-packages\optimum\exporters\tasks.py", line 1593, in _infer_task_from_model_name_or_path
raise RuntimeError(
RuntimeError: Cannot infer the task from a local directory yet, please specify the task manually (image-to-text, image-to-image, image-classification, audio-classification, mask-generation, feature-extraction, zero-shot-image-classification, object-detection, image-segmentation, text-to-audio, semantic-segmentation, masked-im, sentence-similarity, audio-xvector, conversational, audio-frame-classification, stable-diffusion, automatic-speech-recognition, text2text-generation, fill-mask, question-answering, multiple-choice, text-classification, text-generation, zero-shot-object-detection, token-classification, stable-diffusion-xl, depth-estimation).

aoke79 · 2024-09-10T03:25:48Z

it worked for --task text-generation-with-past, like below:

INFO:nncf:Statistics of the bitwidth distribution:
+----------------+-----------------------------+----------------------------------------+
| Num bits (N) | % all parameters (layers) | % ratio-defining parameters (layers) |
+================+=============================+========================================+
| 8 | 4% (2 / 226) | 0% (0 / 224) |
+----------------+-----------------------------+----------------------------------------+
| 4 | 96% (224 / 226) | 100% (224 / 224) |
+----------------+-----------------------------+----------------------------------------+
Applying Weight Compression ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 226/226 • 0:03:17 • 0:00:00
Set tokenizer padding side to left for text-generation-with-past task.

BTW: how can I know which parameter used for which models?
Thanks a lot

aoke79 · 2024-09-10T03:38:06Z

I used the new generated the model, "benchmark_genai" still do not work on that.

python benchmark_genai.py -m C:\AIGC\hf\llama2_7b_chat_ov_int4_default_24_3 -p "why the Sun is yellow?" -nw 1 -n 1 -mt 200 -d NPU
Traceback (most recent call last):
File "C:\AIGC\openvino\openvino.genai\samples\python\benchmark_genai\benchmark_genai.py", line 49, in
main()
File "C:\AIGC\openvino\openvino.genai\samples\python\benchmark_genai\benchmark_genai.py", line 32, in main
pipe.generate(prompt, config)
RuntimeError: Exception from C:\Jenkins\workspace\private-ci\ie\build-windows-vs2019\b\repos\openvino.genai\src\cpp\src\llm_pipeline_static.cpp:206:
Currently only batch size=1 is supported

Thanks,

TolyaTalamanov · 2024-10-31T09:09:20Z

Hi @aoke79 the problem should be fixed already, please update packages:

pip uninstall openvino openvino-tokenizers openvino-genai
pip install --pre openvino openvino-tokenizers openvino-genai --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/nightly

Panepo · 2024-12-06T06:25:03Z

Hi @TolyaTalamanov,

I have update packages but got a similar problem, the program terminates without any message. Here's my code:

import openvino_genai as ov_genai

model_dir = "models/TinyLlama-1.1B-Chat-v1.0-int4-ov"
pipe = ov_genai.LLMPipeline(str(model_dir), "NPU")

config = ov_genai.GenerationConfig()
config.max_new_tokens = 2048

message = "Good morning"
response = pipe.generate([message], config)

perf_metrics = response.perf_metrics
print(f"Load time: {perf_metrics.get_load_time():.2f} ms")
print(f"Generate time: {perf_metrics.get_generate_duration().mean:.2f} ± {perf_metrics.get_generate_duration().std:.2f} ms")
print(f"Tokenization time: {perf_metrics.get_tokenization_duration().mean:.2f} ± {perf_metrics.get_tokenization_duration().std:.2f} ms")
print(f"Detokenization time: {perf_metrics.get_detokenization_duration().mean:.2f} ± {perf_metrics.get_detokenization_duration().std:.2f} ms")
print(f"TTFT: {perf_metrics.get_ttft().mean:.2f} ± {perf_metrics.get_ttft().std:.2f} ms")
print(f"TPOT: {perf_metrics.get_tpot().mean:.2f} ± {perf_metrics.get_tpot().std:.2f} ms")
print(f"Throughput : {perf_metrics.get_throughput().mean:.2f} ± {perf_metrics.get_throughput().std:.2f} tokens/s")

The model is download from OpenVINO HuggingFace

My CPU is Ultra7 165U, NPU driver version is 32.0.100.3104 and the platform is Win11 23H2

Thanks

andrei-kochin assigned TolyaTalamanov Oct 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

failed to run Llama-2-7b-chat-hf on NPU through Sample/Python #820

failed to run Llama-2-7b-chat-hf on NPU through Sample/Python #820

aoke79 commented Sep 4, 2024

aoke79 commented Sep 4, 2024

aoke79 commented Sep 9, 2024

Wovchena commented Sep 9, 2024

aoke79 commented Sep 10, 2024

aoke79 commented Sep 10, 2024

aoke79 commented Sep 10, 2024

TolyaTalamanov commented Oct 31, 2024

Panepo commented Dec 6, 2024 •

edited

Loading

failed to run Llama-2-7b-chat-hf on NPU through Sample/Python #820

failed to run Llama-2-7b-chat-hf on NPU through Sample/Python #820

Comments

aoke79 commented Sep 4, 2024

aoke79 commented Sep 4, 2024

aoke79 commented Sep 9, 2024

Wovchena commented Sep 9, 2024

aoke79 commented Sep 10, 2024

aoke79 commented Sep 10, 2024

aoke79 commented Sep 10, 2024

TolyaTalamanov commented Oct 31, 2024

Panepo commented Dec 6, 2024 • edited Loading

Panepo commented Dec 6, 2024 •

edited

Loading