diff --git a/docs/icicle/image.png b/docs/icicle/image.png new file mode 100644 index 0000000..9e6aeca Binary files /dev/null and b/docs/icicle/image.png differ diff --git a/docs/icicle/multi-gpu.md b/docs/icicle/multi-gpu.md new file mode 100644 index 0000000..fd7e2fd --- /dev/null +++ b/docs/icicle/multi-gpu.md @@ -0,0 +1,64 @@ +# Multi GPU with ICICLE + +:::info + +If you are looking for the Multi GPU API documentation refer here for [Rust](./rust-bindings/multi-gpu.md). + +::: + +One common challenge with Zero-Knowledge computation is often managing the large input sizes. It's not uncommon to encounter circuits surpassing 2^25 constraints, such large inputs push the capabilities of even advanced GPUs to their limits. To effectively scale and process such large circuits, leveraging multiple GPUs in tandem becomes a necessity. + +Multi-GPU programming involves developing software to operate across multiple GPU devices. Lets first explore different approaches to Multi-GPU programming then we will cover how ICICLE allows you to easily develop youR ZK computations to run across many GPUs. + + +## Approaches to Multi GPU programming + +There are many [different strategies](https://github.com/NVIDIA/multi-gpu-programming-models) available for implementing multi GPU, however, it can be split into two categories. + +### GPU Server approach + +This approach usually involves a single or multiple CPUs opening threads to read / write from multiple GPUs. You can think about it as a scaled up HOST - Device model. + +![alt text](image.png) + +This approach wont let us tackle larger computation sizes but it will allow us to compute multiple computations which we wouldn't be able to load onto a single GPU. + +For example lets say that you had to compute two MSMs of size 2^20 on a 16GB VRAM GPU you would normally have to perform them asynchronously. However, if you double the number of GPUs in your system you can now run them in parallel. + + +### Inter GPU approach + +This approach involves a more sophisticated approach to multi GPU computation. Using technologies such as [GPUDirect, NCCL, NVSHMEM](https://www.nvidia.com/en-us/on-demand/session/gtcspring21-cwes1084/) and NVLink its possible to combine multiple GPUs and split a computation among different devices. + +This approach requires redesigning the algorithm at the software level to be compatible with splitting amongst devices. In some cases, to lower latency to a minimum, special inter GPU connections would be installed on a server to allow direct communication between multiple GPUs. + + +# Writing ICICLE Code for Multi GPUs + +The approach we have taken for the moment is a GPU Server approach; we assume you have a machine with multiple GPUs and you wish to run some computation on each GPU. + +To dive deeper and learn about the API checkout the docs for our different ICICLE API + +- [Rust Multi GPU APIs](./rust-bindings/multi-gpu.md) +- C++ Multi GPU APIs + + +## Best practices + +- Never hardcode device IDs, if you want your software to take advantage of all GPUs on a machine use methods such as `get_device_count` to support arbitrary number of GPUs. + +- Launch one thread per GPU, to avoid nasty errors and hard to read code we suggest that for every GPU task you wish to launch you create a dedicated thread. This will make your code way more manageable, easy to read and performant. + +## ZKContainer support for multi GPUs + +Multi GPU support should work with ZK-Containers by simply defining which devices the docker container should interact with: + +```sh +docker run -it --gpus '"device=0,2"' zk-container-image +``` + +If you wish to expose all GPUs + +```sh +docker run --gpus all zk-container-image +``` diff --git a/docs/icicle/rust-bindings/multi-gpu.md b/docs/icicle/rust-bindings/multi-gpu.md new file mode 100644 index 0000000..428f2af --- /dev/null +++ b/docs/icicle/rust-bindings/multi-gpu.md @@ -0,0 +1,199 @@ +# Multi GPU APIs + +To learn more about the theory of Multi GPU programming refer to [this part](../multi-gpu.md) of documentation. + +Here we will cover the core multi GPU apis and a [example](#a-multi-gpu-example) + +## Device management API + +To streamline device management we offer as part of `icicle-cuda-runtime` package methods for dealing with devices. + +#### [`set_device`](https://github.com/vhnatyk/icicle/blob/275eaa99040ab06b088154d64cfa50b25fbad2df/wrappers/rust/icicle-cuda-runtime/src/device.rs#L6) + +Sets the current CUDA device by its ID, when calling `set_device` it will set the current thread to a CUDA device. + +**Parameters:** + +- `device_id: usize`: The ID of the device to set as the current device. Device IDs start from 0. + +**Returns:** + +- `CudaResult<()>`: An empty result indicating success if the device is set successfully. In case of failure, returns a `CudaError`. + +**Errors:** + +- Returns a `CudaError` if the specified device ID is invalid or if a CUDA-related error occurs during the operation. + +**Example:** + +```rust +let device_id = 0; // Device ID to set +match set_device(device_id) { + Ok(()) => println!("Device set successfully."), + Err(e) => eprintln!("Failed to set device: {:?}", e), +} +``` + +#### [`get_device_count`](https://github.com/vhnatyk/icicle/blob/275eaa99040ab06b088154d64cfa50b25fbad2df/wrappers/rust/icicle-cuda-runtime/src/device.rs#L10) + +Retrieves the number of CUDA devices available on the machine. + +**Returns:** + +- `CudaResult`: The number of available CUDA devices. On success, contains the count of CUDA devices. On failure, returns a `CudaError`. + +**Errors:** + +- Returns a `CudaError` if a CUDA-related error occurs during the retrieval of the device count. + +**Example:** + +```rust +match get_device_count() { + Ok(count) => println!("Number of devices available: {}", count), + Err(e) => eprintln!("Failed to get device count: {:?}", e), +} +``` + +#### [`get_device`](https://github.com/vhnatyk/icicle/blob/275eaa99040ab06b088154d64cfa50b25fbad2df/wrappers/rust/icicle-cuda-runtime/src/device.rs#L15) + +Retrieves the ID of the current CUDA device. + +**Returns:** + +- `CudaResult`: The ID of the current CUDA device. On success, contains the device ID. On failure, returns a `CudaError`. + +**Errors:** + +- Returns a `CudaError` if a CUDA-related error occurs during the retrieval of the current device ID. + +**Example:** + +```rust +match get_device() { + Ok(device_id) => println!("Current device ID: {}", device_id), + Err(e) => eprintln!("Failed to get current device: {:?}", e), +} +``` + +## Device context API + +The `DeviceContext` is embedded into `NTTConfig`, `MSMConfig` and `PoseidonConfig`, meaning you can simple pass a `device_id` to your existing config an the same computation will be triggered on a different device automatically. + +#### [`DeviceContext`](https://github.com/vhnatyk/icicle/blob/eef6876b037a6b0797464e7cdcf9c1ecfcf41808/wrappers/rust/icicle-cuda-runtime/src/device_context.rs#L11) + +Represents the configuration a CUDA device, encapsulating the device's stream, ID, and memory pool. The default device is always `0`, unless configured otherwise. + +```rust +pub struct DeviceContext<'a> { + pub stream: &'a CudaStream, + pub device_id: usize, + pub mempool: CudaMemPool, +} +``` + +##### Fields + +- **`stream: &'a CudaStream`** + + A reference to a `CudaStream`. This stream is used for executing CUDA operations. By default, it points to a null stream CUDA's default execution stream. + +- **`device_id: usize`** + + The index of the GPU currently in use. The default value is `0`, indicating the first GPU in the system. + +- **`mempool: CudaMemPool`** + + Represents the memory pool used for CUDA memory allocations. The default is set to a null pointer, which signifies the use of the default CUDA memory pool. + +##### Implementation Notes + +- The `DeviceContext` structure is cloneable and can be debugged, facilitating easier logging and duplication of contexts when needed. + + +#### [`DeviceContext::default_for_device(device_id: usize) -> DeviceContext<'static>`](https://github.com/vhnatyk/icicle/blob/eef6876b037a6b0797464e7cdcf9c1ecfcf41808/wrappers/rust/icicle-cuda-runtime/src/device_context.rs#L30C12-L30C30) + +Provides a default `DeviceContext` with system-wide defaults, ideal for straightforward setups. + +#### Returns + +A `DeviceContext` instance configured with: +- The default stream (`null_mut()`). +- The default device ID (`0`). +- The default memory pool (`null_mut()`). + +#### Parameters + +- **`device_id: usize`**: The ID of the device for which to create the context. + +#### Returns + +A `DeviceContext` instance with the provided `device_id` and default settings for the stream and memory pool. + + +#### [`check_device(device_id: i32)`](https://github.com/vhnatyk/icicle/blob/eef6876b037a6b0797464e7cdcf9c1ecfcf41808/wrappers/rust/icicle-cuda-runtime/src/device_context.rs#L42) + +Validates that the specified `device_id` matches the ID of the currently active device, ensuring operations are targeted correctly. + +#### Parameters + +- **`device_id: i32`**: The device ID to verify against the currently active device. + +#### Behavior + +- **Panics** if the `device_id` does not match the active device's ID, preventing cross-device operation errors. + +#### Example + +```rust +let device_id: i32 = 0; // Example device ID +check_device(device_id); +// Ensures that the current context is correctly set for the specified device ID. +``` + + +## A Multi GPU example + +In this example we will display how you can + +1. Fetch the number of devices installed on a machine +2. For every GPU launch a thread and set a active device per thread. +3. Execute a MSM on each GPU + + + +```rust + +... + +let device_count = get_device_count().unwrap(); + +(0..device_count) + .into_par_iter() + .for_each(move |device_id| { + set_device(device_id).unwrap(); + + // you can allocate points and scalars_d here + + let mut cfg = MSMConfig::default_for_device(device_id); + cfg.ctx.stream = &stream; + cfg.is_async = true; + cfg.are_scalars_montgomery_form = true; + msm(&scalars_d, &HostOrDeviceSlice::on_host(points), &cfg, &mut msm_results).unwrap(); + + // collect and process results + }) + +... +``` + + +We use `get_device_count` to fetch the number of connected devices, device IDs will be `0...device_count-1` + +[`into_par_iter`](https://docs.rs/rayon/latest/rayon/iter/trait.IntoParallelIterator.html#tymethod.into_par_iter) is a parallel iterator, you should expect it to launch a thread for every iteration. + +We then call `set_device(device_id).unwrap();` it should set the context of that thread to the selected `device_id`. + +Any data you now allocate from the context of this thread will be linked to the `device_id`. We create our `MSMConfig` with the selected device ID `let mut cfg = MSMConfig::default_for_device(device_id);`, behind the scene this will create for us a `DeviceContext` configured for that specific GPU. + +We finally call our `msm` method. diff --git a/sidebars.js b/sidebars.js index f5e487f..eae710f 100644 --- a/sidebars.js +++ b/sidebars.js @@ -30,9 +30,20 @@ module.exports = { id: "icicle/golang-bindings", }, { - type: "doc", + type: "category", label: "Rust bindings", - id: "icicle/rust-bindings", + link: { + type: `doc`, + id: "icicle/rust-bindings", + }, + collapsed: true, + items: [ + { + type: "doc", + label: "Multi GPU Support", + id: "icicle/rust-bindings/multi-gpu", + } + ] }, { type: "category", @@ -60,6 +71,11 @@ module.exports = { } ], }, + { + type: "doc", + label: "Multi GPU Support", + id: "icicle/multi-gpu", + }, { type: "doc", label: "Supporting additional curves",