These examples showcase inference of text-generation Large Language Models (LLMs): chatglm
, LLaMA
, Qwen
and other models with the same signature. The applications don't have many configuration options to encourage the reader to explore and modify the source code. Loading openvino_tokenizers
to ov::Core
enables tokenization. Run convert_tokenizer
to generate IRs for the samples. group_beam_searcher.hpp implements the algorithm of the same name, which is used by beam_search_causal_lm
. There is also a Jupyter notebook which provides an example of LLM-powered Chatbot in Python.
A common LLM inference optimisation is introduction of past KV (key/value)-cache. This cache is represented by the corresponding inputs and outputs in a model implemented originally in DL framework (e.g. PyTorch models from HuggingFace). To optimize it further and simplify usage, the model is transformed to a stateful form. This transformation improves inference performance and decreases amount of allocated runtime memory in long running text generation scenarios. It is achieved by hiding inputs and outputs of the model that represent past KV-cache tensors and handling them inside the model in a more efficient way. Although the cache is still accessible with state API. It is opposed to stateless model approach requiring manipulating these inputs and outputs explicitly. An introduction to stateful models can be found in https://docs.openvino.ai/2023.3/openvino_docs_OV_UG_stateful_models_intro.html.
Hiding KV-cache introduces a peculiarity for beam search algorithm. Beam search suggests batched inference of multiple beams. The design described here so far would result in generating multiple independent sequences of tokens. Beam search algorithm, on the other hand, requires removing some of the ongoing beams and splitting other beams to multiple branches. Beam removal requires deleting corresponding KV-cache entry and beam splitting requires copying corresponding KV-cache values.
To provide the possibility to implement beam search without accessing model's internal state, a stateful LLM converted with optimum-intel
or llm_bench introduces an additional 1-dimentional beam_idx
input. beam_idx
must contain indexes of elements in a batch which are intended to be selected and will evolve during the next beam search iteration. There's only one beam when the generation starts. That beam corresponds to the initial prompt. beam_idx
must have values: [0, 0]
to keep the initial beam and introduce its copy. The dynamic batch size enables to change the number of beams dynamically. beam_idx
must have [1]
as the value to remove zeroth sequence and keep the second beam only.
Assume there are two running beams. To proceed with generating both beams at the next iteration, beam_idx
values must be [0, 1]
, pointing to batch elements 0
and 1
. To drop the last beam and split the other beam in two, beam_idx
must be set to [0, 0]
. This results in utilizing only the part of KV cache corresponding to the zeroth element in the batch. The process of selecting proper entries in cache is called Cache Reorder.
The images below represent stateless and stateful LLM pipelines. The model has 4 inputs:
input_ids
contains the next selected tokenattention_mask
is filled with1
position_ids
encodes a position of currently generating token in the sequencebeam_idx
selects beams
The model has 1 output logits
describing the predicted distribution over the next tokens. And there's KV cache state.
The program loads a tokenizer, a detokenizer and a model (.xml
and .bin
) to OpenVINO. A prompt is tokenized and passed to the model. The model greedily generates token by token until the special end of sequence (EOS) token is obtained. The predicted tokens are converted to chars and printed in a streaming fashion.
The program loads a tokenizer, a detokenizer and a model (.xml
and .bin
) to OpenVINO. A prompt is tokenized and passed to the model. The model predicts a distribution over the next tokens and group beam search samples from that distribution to explore possible sequesnses. The result is converted to chars and printed.
Speculative decoding (or assisted-generation in HF terminology) is a recent technique, that allows to speed up token generation when an additional smaller draft model is used alonside with the main model.
Speculative decoding works the following way. The draft model predicts the next K tokens one by one in an autoregressive manner, while the main model validates these predictions and corrects them if necessary. We go through each predicted token, and if a difference is detected between the draft and main model, we stop and keep the last token predicted by the main model. Then the draft model gets the latest main prediction and again tries to predict the next K tokens, repeating the cycle.
This approach reduces the need for multiple infer requests to the main model, enhancing performance. For instance, in more predictable parts of text generation, the draft model can, in best-case scenarios, generate the next K tokens that exactly match the target. In tha caste the are validated in a single inference request to the main model (which is bigger, more accurate but slower) instead of running K subsequent requests. More details can be found in the original paper https://arxiv.org/pdf/2211.17192.pdf, https://arxiv.org/pdf/2302.01318.pdf
Note
Models should belong to the same family and have same tokenizers.
Install OpenVINO Archives >= 2023.3. <INSTALL_DIR>
below refers to the extraction location.
git submodule update --init
source <INSTALL_DIR>/setupvars.sh
cmake -DCMAKE_BUILD_TYPE=Release -S ./ -B ./build/ && cmake --build ./build/ -j
git submodule update --init
<INSTALL_DIR>\setupvars.bat
cmake -S .\ -B .\build\ && cmake --build .\build\ --config Release -j
The --upgrade-strategy eager
option is needed to ensure optimum-intel
is upgraded to the latest version.
source <INSTALL_DIR>/setupvars.sh
python3 -m pip install --upgrade-strategy eager "transformers<4.38" -r ../../../llm_bench/python/requirements.txt ../../../thirdparty/openvino_tokenizers/[transformers] --extra-index-url https://download.pytorch.org/whl/cpu
python3 ../../../llm_bench/python/convert.py --model_id TinyLlama/TinyLlama-1.1B-Chat-v1.0 --output_dir ./TinyLlama-1.1B-Chat-v1.0/ --precision FP16
convert_tokenizer ./TinyLlama-1.1B-Chat-v1.0/pytorch/dldt/FP16/ --output ./TinyLlama-1.1B-Chat-v1.0/pytorch/dldt/FP16/ --with-detokenizer --trust-remote-code
<INSTALL_DIR>\setupvars.bat
python -m pip install --upgrade-strategy eager "transformers<4.38" -r ..\..\..\llm_bench\python\requirements.txt ..\..\..\thirdparty\openvino_tokenizers\[transformers] --extra-index-url https://download.pytorch.org/whl/cpu
python ..\..\..\llm_bench\python\convert.py --model_id TinyLlama/TinyLlama-1.1B-Chat-v1.0 --output_dir .\TinyLlama-1.1B-Chat-v1.0\ --precision FP16
convert_tokenizer .\TinyLlama-1.1B-Chat-v1.0\pytorch\dldt\FP16\ --output .\TinyLlama-1.1B-Chat-v1.0\pytorch\dldt\FP16\ --with-detokenizer --trust-remote-code
Usage:
greedy_causal_lm <MODEL_DIR> "<PROMPT>"
beam_search_causal_lm <MODEL_DIR> "<PROMPT>"
speculative_decoding_lm <DRAFT_MODEL_DIR> <MAIN_MODEL_DIR> "<PROMPT>"
Examples:
./build/greedy_causal_lm ./TinyLlama-1.1B-Chat-v1.0/pytorch/dldt/FP16/ "Why is the Sun yellow?"
./build/beam_search_causal_lm ./TinyLlama-1.1B-Chat-v1.0/pytorch/dldt/FP16/ "Why is the Sun yellow?"
./build/speculative_decoding_lm ./TinyLlama-1.1B-Chat-v1.0/pytorch/dldt/FP16/ ./Llama-2-7b-chat-hf/pytorch/dldt/FP16/ "Why is the Sun yellow?"
To enable Unicode characters for Windows cmd open Region
settings from Control panel
. Administrative
->Change system locale
->Beta: Use Unicode UTF-8 for worldwide language support
->OK
. Reboot.
- chatglm
- https://huggingface.co/THUDM/chatglm2-6b - refer to
chatglm2-6b - AttributeError: can't set attribute
in case of
AttributeError
- https://huggingface.co/THUDM/chatglm3-6b
- https://huggingface.co/THUDM/chatglm2-6b - refer to
chatglm2-6b - AttributeError: can't set attribute
in case of
- LLaMA 2
- https://huggingface.co/meta-llama/Llama-2-13b-chat-hf
- https://huggingface.co/meta-llama/Llama-2-13b-hf
- https://huggingface.co/meta-llama/Llama-2-7b-chat-hf
- https://huggingface.co/meta-llama/Llama-2-7b-hf
- https://huggingface.co/meta-llama/Llama-2-70b-chat-hf
- https://huggingface.co/meta-llama/Llama-2-70b-hf
- Llama2-7b-WhoIsHarryPotter
- OpenLLaMA
- TinyLlama
- Qwen
- https://huggingface.co/Qwen/Qwen-7B-Chat
- https://huggingface.co/Qwen/Qwen-7B-Chat-Int4 - refer to
Qwen-7B-Chat-Int4 - Torch not compiled with CUDA enabled
in case of
AssertionError
- Dolly
- Phi
- notus-7b-v1
- zephyr-7b-beta
This pipeline can work with other similar topologies produced by optimum-intel
with the same model signature.