Skip to content

rafalum/bls_cuda

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

38 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

High throughput BLS aggregation

What is this?

This is a minimalistic implementation of the BLS aggregation algorithm running on a GPU (currently only BLS12-381). It is designed to be used in a high-throughput setting, where many aggregation operations are performed in parallel.

The goal is to aggregate a batch of 64k public keys every 0.5ms in a streaming fashion.

How does it work?

In order to achieve high throughput, we make use of a 4 stage pipeline with the following stages:

  • stage 1: copy num_points public keys from the host to the GPU global memory (h_2_d)
  • stage 2: launch log2(num_points / num_results) addition kernels that repeatedly halve the points to num_results results (add)
  • stage 3: copy the final results from the GPU global memory to the host (d_2_h)
  • stage 4: reduce the final results on the host

To accomodate the pipeline we partition the GPU memory into two regions (A and B), where each region handels alternating batches of public keys. Here is a visualization of the process:

About

High throughput BLS public key aggregation

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published