Rewrite job pool to support nested parallelism for improved scalability #339
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Before this change, given N images to encode, the caller had two options:
basis_compress
for each image; internally this API would use multiple threads (when possible) to process large images on multiple threadsbasis_parallel_compress
for the group of images; internally this API would encode each image on one threadNeither of these would be optimal. For example, given 4 4K images and a 16-core system,
basis_parallel_compress
is suboptimal because it would only use 4 out of 16 cores for encoding;basis_compress
would be better, but it still would be suboptimal because encoding of a large image doesn't scale perfectly and there would be gaps of inactivity where only one core would do the work (notably, decoding the image from PNG inputs or rescaling to compute mips). Also, scaling would decrease in efficiency for smaller mips even for one image.To fully saturate the system, you could use
basis_compress
and call it from multiple threads. This solves the problem of having enough concurrent work, but results in overhead due to context switches/thread preemption. It's difficult to size the two pools required in this case (outer and inner) wrt how many threads each has - especially if the goal is to reach a given level of total parallelism that is less than the system provides (e.g.-j16
on a 24-thread system) to avoid system wide performance issues due to oversubscription.For this to work well we ideally need to have basis_parallel_compress use a single job pool - both for "outer" processing, handling the flow of image decoding, and for "inner" processing (threading the work required for one image). This would get us the best of both worlds - a predictable total system fanout limited by the job pool capacity, and a full saturation of the said pool.
For this to work well, we need to change the mechanics of job dispatch. Notably, when we wait for the jobs to finish, it's important that this only interacts with the jobs launched for the current image - both in terms of which jobs we wait for, and which jobs we help process. Without this there's going to be cross-talk between individual "outer" jobs which can result in poor scalability.
To solve this, the job_pool API gets augmented with optional tokens - these are just outstanding job counters, and the pointer to the counter is saved in the job struct, which allows us to selectively steal only jobs relevant for a given token. The token is kept optional just in case simpler uses of this API are interesting elsewhere in this project, although all image encoding really should use tokens to work with nested parallelism - forgetting it in the nested calls may lead to deadlocks as two running jobs would wait for each other with no forward progress.
Note that all operations on tokens are protected by the mutex so we don't need to use atomics. In fact, the use of atomics in the old code contained a bug that lead to a rare deadlock on shutdown, see https://cohost.org/zeux/post/520125-condition-variables.
This code has been extensively tested as part of gltfpack. Performance numbers on an example glTF model with 12 2K textures using ETC1S encoding on AMD Ryzen 5900X (12-core/24-thread system):
basis_compress
one image at a time, no threading: 44sbasis_compress
one image at a time: 12.5sbasis_parallel_compress
before this change: 9.0sbasis_parallel_compress
after this change: 4.2s(this change was tested on a variety of models and should never degrade performance and almost always result in an improvement, the above is just an example - notably, even though all textures there are the same size, they don't take the same amount of time to compress, so even with 12 serially compressed textures which is what
basis_parallel_compress
did by default, we have a very significant improvement in overall time with the new code)