Parallel sampling with threadpool #1252

mzegla · 2024-11-25T10:15:11Z

This PR implements the same functionality as: #1233, but in a different manner. Only one of them should be merged.

Since pipeline logic is executed on a single thread, there are periods of low CPU usage while pipeline is not executing inference, but some other logic like sampling which can take quite large fraction of time. Currently after inference is done we sample from each sequence group in a loop on a single thread which becomes an issue with sampling parameters that significantly extend sampling time for a single sequence group.

This PR extracts sampling logic for single sequence group into a separate method that can be executed independently from any other sequence group. In includes generic thread pool implementation that spawns certain amount of threads that are used to run sampling logic for different sequence groups in parallel.

Performance measurements confirm improvement especially for non greedy sampling and with high concurrency (the more sequence groups are scheduled for inference the more benefit from parallel sampling).

post rebase adjustments fix finish iteration move currently_processed_tokens update switch to async experimental threadpool remove access to shared struct in parallelized code synchronize beam search part

mzegla added 4 commits November 25, 2024 11:11

extract sampling for single sequence group and call it asynchronously

c859a68

post rebase adjustments fix finish iteration move currently_processed_tokens update switch to async experimental threadpool remove access to shared struct in parallelized code synchronize beam search part

refactor

def87de

extended timers

fec70f1

style

0c26c92

mzegla mentioned this pull request Nov 25, 2024

Parallel sampling with ov::threading #1233

Draft

github-actions bot added category: continuous batching Continuous batching category: sampling Sampling / Decoding algorithms no-match-files labels Nov 25, 2024

ilya-lavrenov self-assigned this Nov 26, 2024

ilya-lavrenov added this to the 2025.0 milestone Dec 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel sampling with threadpool #1252

Parallel sampling with threadpool #1252

mzegla commented Nov 25, 2024

Parallel sampling with threadpool #1252

Are you sure you want to change the base?

Parallel sampling with threadpool #1252

Conversation

mzegla commented Nov 25, 2024