You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, the kernel code generation is bound to a specific workload size (dimensions) here. So, whenever the dimensions change, the code is regenerated and recompiled.
@gabriellaraujo1903 reported a substantial overhead of this behavior:
When the workload size changes, GSParLib recompiles the GPU kernel.
For instance, suppose we execute a vector sum where the vector's size is 10,000; Then we run another vector sum where the vector's size is 50,000. In this case, GSParLib will recompile the GPU kernel.
This behaviour imposes a performance degradation when a GPU kernel is executed several times, and the workload size continuously changes.
It occurs in the MG program from NPB. It is an iterative program where the GPU kernels are called thousands of times, and the workload varies continuously. Recompiling, in this case, imposes a considerable performance degradation; GPU execution time can be even worse than the serial code.
On the other hand, CUDA does not require a GPU kernel recompilation when changing the workload size. If I'm correct, only batching would require recompilation; I do not remember right now.
This issue aims to avoid recompiling the kernel when the workload changes. We need to investigate if we can reuse the code when the workload changes.
The workload is passed as an argument in the kernel launch, so maybe we just need to remove the extra compilation step. The code probably needs to be recompiled when dimensions (x, y, z) are added or removed, so the main aim of this issue is to avoid recompiling when the workload size changes.
The text was updated successfully, but these errors were encountered:
Currently, the kernel code generation is bound to a specific workload size (dimensions) here. So, whenever the dimensions change, the code is regenerated and recompiled.
@gabriellaraujo1903 reported a substantial overhead of this behavior:
This issue aims to avoid recompiling the kernel when the workload changes. We need to investigate if we can reuse the code when the workload changes.
The workload is passed as an argument in the kernel launch, so maybe we just need to remove the extra compilation step. The code probably needs to be recompiled when dimensions (x, y, z) are added or removed, so the main aim of this issue is to avoid recompiling when the workload size changes.
The text was updated successfully, but these errors were encountered: