This is a minimalistic implementation of the BLS aggregation algorithm running on a GPU (currently only BLS12-381). It is designed to be used in a high-throughput setting, where many aggregation operations are performed in parallel.
The goal is to aggregate a batch of 64k public keys every 0.5ms in a streaming fashion.
In order to achieve high throughput, we make use of a 4 stage pipeline with the following stages:
- stage 1: copy
num_points
public keys from the host to the GPU global memory (h_2_d) - stage 2: launch
log2(num_points / num_results)
addition kernels that repeatedly halve the points tonum_results
results (add) - stage 3: copy the final results from the GPU global memory to the host (d_2_h)
- stage 4: reduce the final results on the host
To accomodate the pipeline we partition the GPU memory into two regions (A and B), where each region handels alternating batches of public keys. Here is a visualization of the process: