High throughput BLS aggregation

What is this?

This is a minimalistic implementation of the BLS aggregation algorithm running on a GPU (currently only BLS12-381). It is designed to be used in a high-throughput setting, where many aggregation operations are performed in parallel.

The goal is to aggregate a batch of 64k public keys every 0.5ms in a streaming fashion.

How does it work?

In order to achieve high throughput, we make use of a 4 stage pipeline with the following stages:

stage 1: copy num_points public keys from the host to the GPU global memory (h_2_d)
stage 2: launch log2(num_points / num_results) addition kernels that repeatedly halve the points to num_results results (add)
stage 3: copy the final results from the GPU global memory to the host (d_2_h)
stage 4: reduce the final results on the host

To accomodate the pipeline we partition the GPU memory into two regions (A and B), where each region handels alternating batches of public keys. Here is a visualization of the process:

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
benches		benches
src		src
tests		tests
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

High throughput BLS aggregation

What is this?

How does it work?

About

Releases

Packages

Languages

rafalum/bls_cuda

Folders and files

Latest commit

History

Repository files navigation

High throughput BLS aggregation

What is this?

How does it work?

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages