Generate pipeline (#334) #480

Wovchena · 2024-06-07T19:11:14Z

LLM return logits with probabilities of each token, these probabilities can be converted to tokens/words with different technics: greedy decoding, beam search decoding, random sampling, etc.

This requires writing user unfriendly post-processing even for the simplest scenario of greedy decoding. In order to make live easier we we combined all decoding scenarios into a single function call, where the decoding method and parameters are specified by arguments.

In this PR we provide a user friendly API for text generation inspired by generate method from HuggingFace transformers library.

enable calling tokenizers/detokenizers from LLMPipeline
add callback for streaming mode - done partially, need to improve
rewritten samples with the current approach: causal_lm/cpp/generate_pipeline/generate_sample.cpp#L73-L83
Multibatch greedy decoding
Speculative decoding
Grouped Beam Search decoding: ready for batch 1, need to rebase multibatch support after merging
Add multi prompt support for beam search #349
Random sampling

Example 1: Greedy search generation

LLMPipeline pipe(model_path, device);

// Will try to load config from generation_config.json.
// but if not found default velues for gready search will be used
GenerationConfig config = pipe.generation_config();

cout << pipe(prompt, config.max_new_tokens(20));

Example 2: TextStreaming mode

LLMPipeline pipe(model_path, device);

GenerationConfig config = pipe.generation_config();

auto text_streamer = TextStreamer{pipe};
auto text_streamer_callback = [&text_streamer](std::vector<int64_t>&& tokens, LLMPipeline& pipe){
    text_streamer.put(tokens[0]);
};

pipe(prompt, config.max_new_tokens(20).set_callback(text_streamer_callback));
text_streamer.end();

CVS-132907 CVS-137920

LLM return logits with probabilities of each token, these probabilities can be converted to tokens/words with different technics: greedy decoding, beam search decoding, random sampling, etc. This requires writing user unfriendly post-processing even for the simplest scenario of greedy decoding. In order to make live easier we we combined all decoding scenarios into a single function call, where the decoding method and parameters are specified by arguments. In this PR we provide a user friendly API for text generation inspired by `generate` method from HuggingFace transformers library. - [x] enable calling tokenizers/detokenizers from LLMPipeline - [ ] add callback for streaming mode - done partially, need to improve - [x] rewritten samples with the current approach: [causal_lm/cpp/generate_pipeline/generate_sample.cpp#L73-L83](https://github.com/pavel-esir/openvino.genai/blob/generate_pipeline/text_generation/causal_lm/cpp/generate_pipeline/generate_sample.cpp#L73-L83) - [x] Multibatch greedy decoding - [ ] Speculative decoding - [ ] Grouped Beam Search decoding: ready for batch 1, need to rebase multibatch support after merging openvinotoolkit#349 - [x] Random sampling Example 1: Greedy search generation ``` LLMPipeline pipe(model_path, device); // Will try to load config from generation_config.json. // but if not found default velues for gready search will be used GenerationConfig config = pipe.generation_config(); cout << pipe(prompt, config.max_new_tokens(20)); ``` Example 2: TextStreaming mode ``` LLMPipeline pipe(model_path, device); GenerationConfig config = pipe.generation_config(); auto text_streamer = TextStreamer{pipe}; auto text_streamer_callback = [&text_streamer](std::vector<int64_t>&& tokens, LLMPipeline& pipe){ text_streamer.put(tokens[0]); }; pipe(prompt, config.max_new_tokens(20).set_callback(text_streamer_callback)); text_streamer.end(); ``` CVS-132907 CVS-137920 --------- Co-authored-by: Wovchena <[email protected]> Co-authored-by: Ilya Lavrenov <[email protected]> Co-authored-by: Alexander Suvorov <[email protected]> Co-authored-by: Yaroslav Tarkan <[email protected]> Co-authored-by: Xiake Sun <[email protected]> Co-authored-by: wenyi5608 <[email protected]> Co-authored-by: Ekaterina Aidova <[email protected]> Co-authored-by: guozhong wang <[email protected]> Co-authored-by: Chen Peter <[email protected]>

Wovchena requested review from ilya-lavrenov, pavel-esir, eaidova and andrei-kochin June 7, 2024 19:11

github-actions bot added the category: llm_bench Label for tool/llm_bench folder label Jun 7, 2024

remove optimum-intel

8fe8a1f

ilya-lavrenov approved these changes Jun 10, 2024

View reviewed changes

ilya-lavrenov merged commit 26c3c40 into openvinotoolkit:releases/2024/2 Jun 10, 2024
27 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generate pipeline (#334) #480

Generate pipeline (#334) #480

Wovchena commented Jun 7, 2024

Generate pipeline (#334) #480

Generate pipeline (#334) #480

Conversation

Wovchena commented Jun 7, 2024