Skip to content

Environment Variables

Arun Karthik edited this page Dec 11, 2024 · 6 revisions

Various aspects of the plugin can be configured using the following variables at run-time.

Parameter Description Type Accepted Value
OFI_NCCL_USE_IPV6_TCP Allow using endpoints with IPv6 addressing format for TCP provider. Users can specify to use a preferred libfabric provider with `FI_PROVIDER` environment variable. Boolean 0/1 (Default: 0)
OFI_NCCL_EXCLUDE_TCP_IF List of interface names to be filtered out for TCP provider. Users can specify to use a preferred libfabric provider with `FI_PROVIDER` environment variable. String Comma-separated list of interface names (Default: "lo,docker0")
OFI_NCCL_GDR_FLUSH_DISABLE Disable flush operation when using GPUDirect. Boolean 0/1 (Default: 0)
OFI_NCCL_DISABLE_DMABUF Disable DMABUF Support Boolean 0/1 (Default 1)
OFI_NCCL_NIC_DUP_CONNS Set number of NIC connections. This is used to increase hardware utilization. Applicable for P3Dn when using less number of GPUs than 8.. Integer x, to set x number of connections. Only overridden for greater than 0 values (Default: 0)
OFI_NCCL_CUDA_FLUSH_ENABLE When using GPUDirect use the cudaDeviceFlushGPUDirectRDMAWrites to enforce data consistency at the receiving GPU. Requires CUDA 11.3 or later. Note that this function only provides a GPU memory fence and requires that data has already been delivered to GPU memory. Some networks and PCIe configurations require an additional network-level flush that is not provided by this option. Boolean 0/1 (Default: 0)
OFI_NCCL_CQ_READ_COUNT Adjust the maximum number of completion entries that will be read in a single Libfabric polling loop. In general, users should not have to adjust this value. An array of completion queue entry structures is created on the stack, so large (over 16-32) values of this parameter may cause stack overflows. Integer Default: 4
OFI_NCCL_PROTOCOL Protocol to use for implementing send/recv operations. Default is `SENDRECV`, which uses the Libfabric tagged send/recv interface. This implementation will give the best performance on hardware that implements tagged sends natively, and likely most Libfabric implementations that include an eager send optimization for GPU buffers. The other valid option is `RDMA`, which implements a sender-managed receive queue using RDMA write operations and supports multi-rail channels per GPU. The `RDMA` protocol is likely to work better than `SENDRECV` on networks that do not have an eager optimization or that have multiple NICs per GPU. String Default: SENDRECV
OFI_NCCL_MIN_STRIPE_SIZE Adjust the maximum size of `RDMA` protocol messages that are assigned to multi-rail channels in round-robin mode. Messages larger than the threshold are multiplexed over all channels to increase network throughput. In general, users should not have to adjust this value. A very small threshold may cause the `RDMA` protocol initialization fail since RDMA protocol control messages shall not be multiplexed. Integer Default: 256KiB
OFI_NCCL_NET_LATENCY Internode network latency in us reported to NCCL. Integer Any non-negative integer. Defaults to 0, unless the configured platform sets a specific value.
OFI_NCCL_EAGER_MAX_SIZE Eager message size limit when using RDMA protocol. Message sizes greater than this limit will always be sent using RDMA write instead of eagerly. Integer Any non-negative integer, though must be <= ROUND_ROBIN_THRESHOLD. Defaults to 8KiB.
OFI_NCCL_DISABLE_GDR_REQUIRED_CHECK Disable the check for required GDR support on EC2 instances. When this check is disabled, the plugin can be used without GDR support even on platforms that support GDR (P4d and later). By default, the plugin performs the check. Boolean 0/1 (Default: 0)
OFI_NCCL_MR_KEY_SIZE Specify the memory registration key size in bytes when using a libfabric provider that supports application-selected memory registration keys. Integer Default: 2
OFI_NCCL_MR_CACHE_DISABLE Disable the MR cache. The MR cache is used to keep track of registered memory regions, so that calling regMr() on the same buffer (address and size), will quickly return a previously globally registered MR on that buffer, avoiding redundant (and expensive) registrations with the underlying device. Disabling the MR cache will make all calls to regMR() result in a registration with the device, so it may cause a significant performance degradation. Boolean 0/1 (Default: 0)
OFI_NCCL_DOMAIN_PER_THREAD By default, the plugin creates one Libfabric domain per process. On AWS Tranium instances, it creates one domain per thread instead. This variable can override the default behavior. Integer default:-1 (unset default): use the platform-specific configuration. 0: Allocate one domain per process1: Allocate one domain per thread
OFI_NCCL_DISABLE_NATIVE_RDMA_CHECK On AWS platforms the plugin checks for native RDMA write support when using the RDMA protocol. This variable can disable this check to allow using the RDMA protocol even on platforms where native RDMA write is not supported (or cannot be verified to be supported). Boolean 0/1 (Default: 0)
OFI_NCCL_DISABLE_GDR_REQUIRED_CHECK Disable the check for required GDR support on AWS instances. When this check is disabled, the plugin can be used without GDR support even on platforms that support GDR (P4d and later). Boolean 0/1 (Default: 0)
OFI_NCCL_RDMA_MIN_POSTED_BOUNCE_BUFFERS Minimum bounce buffers posted per endpoint. The plugin will attempt to post more bounce buffers if we dip below this threshold, allocating new bounce buffers if needed. Integer Default: 64
OFI_NCCL_RDMA_MAX_POSTED_BOUNCE_BUFFERS Maximum bounce buffers posted per endpoint. The plugin will not attempt to post more bounce buffers if it reaches this threshold. Integer Default: 128
OFI_NCCL_ERRORCHECK_MUTEX If non-zero, fail if a thread attempts to relock a mutex that it has already locked (used for debug). Boolean Default:1 if debugging is enabled,0 otherwise
OFI_NCCL_ENDPOINT_PER_COMM If zero, create a Libfabric endpoint per domain, shared across all communicators. If non-zero, create different endpoints for receive communicators connected to the same source endpoint, while using a shared completion queue. Boolean 0/1 (Default: 0)

Note: Similar to NCCL or Libfabric, the plugin dynamically loads CUDA dependencies at runtime, specifically libcuda.so. Like NCCL and Libfabric, the plugin does not find CUDA libraries with the CUDA_HOME environment variable. dlopen() will use the LD_LIBRARY_PATH environment variable and then your system's default search path to find libcuda.so. We do this to match NCCL and Libfabric behaviors so that all three components find the same CUDA installation.

Clone this wiki locally