-
Notifications
You must be signed in to change notification settings - Fork 191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallel sampling with ov::threading #1233
base: master
Are you sure you want to change the base?
Conversation
bool stop = false; | ||
|
||
public: | ||
ThreadPool(size_t num_threads = std::thread::hardware_concurrency()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typically, in OV we optimize with parallel_for or similar functions https://github.com/openvinotoolkit/openvino/blob/master/src/core/include/openvino/core/parallel.hpp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This thread pool is used in a loop where next iteration uses some values computed in the last one and task scheduled on another thread needs those values. I'm not familiar with TBB development aspects so correct me if I'm wrong, but isn't such scenario an issue for parallel_for
use? Doesn't each iteration need to be completely independent from another?
post rebase adjustments fix finish iteration move currently_processed_tokens update switch to async experimental threadpool remove access to shared struct in parallelized code synchronize beam search part
589365b
to
52d391a
Compare
72d1af7
to
109c7e1
Compare
89b54bd
to
d542605
Compare
d542605
to
eec70e5
Compare
This PR implements the same functionality as: #1252, but in a different manner. Only one of them should be merged.
Since pipeline logic is executed on a single thread, there are periods of low CPU usage while pipeline is not executing inference, but some other logic like sampling which can take quite large fraction of time. Currently after inference is done we sample from each sequence group in a loop on a single thread which becomes an issue with sampling parameters that significantly extend sampling time for a single sequence group.
This PR extracts sampling logic for single sequence group into a separate method that can be executed independently from any other sequence group. In includes generic thread pool implementation that spawns certain amount of threads that are used to run sampling logic for different sequence groups in parallel.
Performance measurements confirm improvement especially for non greedy sampling and with high concurrency (the more sequence groups are scheduled for inference the more benefit from parallel sampling).