Nvidia target via NVRTC and Nim ↦ CUDA DSL #487

Vindaar · 2024-11-29T11:29:01Z

(Note: this is a draft, because it is a) a proof of concept and b) still depends on nimcuda for simplicity)

Table of contents & introduction

CUDA execution
NVRTC compiler helper
- Important note about CUDA
  library
CUDA code generator
- Notes on the cuda design
Profiling Nvidia code
A more complex example

This (draft) PR adds an experimental alternative to generating code
targeting Nvidia GPUs. Instead of relying on LLVM to generate Nvidia PTX
instructions, this PR adds 3 pieces:

a slightly modified version of the existing "CUDA execution helpers"
defined in codegen_nvidia.nim:
https://github.com/mratsim/constantine/blob/master/constantine/math_compiler/codegen_nvidia.nim
(if we decide to go ahead with this PR, I'll merge the two. They are
compatible, the new one just has a few extra features),
a helper file to initialize an NVRTC (Nvidia runtime compilation
library) compiler and compile a string of CUDA code,
a DSL to generate a CUDA code string from a Nim macro.

A few words on each of these first:

CUDA execution

Starting with this as it is already present in Constantine. Once one has
a compiled CUDA kernel and wishes to execute it, in principle one needs
to:

allocate memory for all arguments to be passed to the kernel, which
are not pure value types
memcopy the data from host to device
call cuLaunchKernel making sure to pass all parameters as an array
of pointers
copy output data back from device to host
free memory

Instead of having to do this manually, we use a typed macro, which
determines the required action based on the parameters passed to it.

The basic usage looks like:

execCuda("someKernel", numBlocks, threadsPerBlock, res, inputs)

where res and inputs are tuples (to support heterogeneous types).

Arguments passed as res are treated as output parameters. They will
both be copied to the device and afterwards back to the local
identifiers.

inputs will either be passed by value or copied, depending on if the
data is ref type or not. NOTE: A currently not implemented feature is
deep copying data structures, which contain references / pointers
themselves. This is important in particular if one wishes to pass data
as a struct of arrays (SoA).

In practice in the context of the code of this PR, you don't directly
interact with execCuda. This is done via the NVRTC compiler in the
next section.

NOTE: The parameters will be passed in the order:

first all elements of the res tuple in the tuple order
then all elements of the inputs tuple in their order

This means that your output arguments must be the first arguments of the
kernel currently!

NVRTC compiler helper

This is essentially an equivalent of the LLVM based NvidiaAssembler
part of the LLVM backend,

https://github.com/mratsim/constantine/blob/master/constantine/math_compiler/codegen_nvidia.nim#L501-L512

Similarly to all CUDA work, lots of boilerplate code is required to
initialize the device, set up the compilation pipeline, call the
compiler on the CUDA string etc. As most of this is identical in the
majority of use cases, we can automate it away. NOTE: We will likely
want to eventually add some context or config object to store e.g.
specific parameters to pass to the NVRTC compiler for example.

As an example, let's look at what the Saxpy
example
from the CUDA documentation looks like for us now.

import runtime_compile

# The actual CUDA kernel code from the example
# NOTE: Compared to the Saxpy documentation example, our kernel uses
# the `out` paramater as the first argument. This is because `execCuda`
# passes the arguments in the order 'all `res`, all `inputs`'.
const Saxpy = cstring"""
extern "C" __global__
void saxpy(float *out, float a, float *x, float *y, size_t n)
{
   size_t tid = blockIdx.x * blockDim.x + threadIdx.x;
   if (tid < n) {
      out[tid] = a * x[tid] + y[tid];
   }
}
"""

proc main =
  var nvrtc = initNvrtc(Saxpy) # init the NVRTC helper with the string
  nvrtc.compile()              # compile the code
  nvrtc.getPtx()               # get the PTX of the code

  # Configure kernel launch parameters
  nvrtc.numBlocks = 32
  nvrtc.threadsPerBlock = 128

  const n = 128 * 32
  let bufferSize = n * csize_t(sizeOf cfloat)
  var a = cfloat 5.1
  var
    hX: array[n, cfloat]
    hY: array[n, cfloat]
    hOut: array[n, cfloat]
  for i in 0 ..< n: # Initialize host data (fill hX and hY with ur data)
    hX[i] = cfloat(i)
    hY[i] = cfloat(i * 2)

  ## Execute SAXPY.
  # `saxpy` is the name of the kernel
  # `hOut` is the output array, `a, hX, hY, n` are inputs
  nvrtc.execute("saxpy", (hOut), (a, hX, hY, n)) 
  for i in 0 ..< n:
    echo fmt"{a} * {hX[i]} + {hY[i]} = {hOut[i]}"

when isMainModule:
  main()

Clearly, most of the steps (compile, getPTtx) could also just be
done as part of the execute. I just haven't merged them yet.

We can see that the majority of the code is now setting up the input
data for the kernel.

Important note about CUDA library

To fully support all CUDA features using NVRTC, we need to use the
header pragma in the CUDA wrapper. See this nimcuda issue about the
problem:

SciNim/nimcuda#27

(Note: the current existing CUDA wrapper in Constantine also avoids the
header pragma. Once we switch to using our own, we'll have to make
that change and thus need the code below)

This implies that we need to know the path to the CUDA libraries at
compile time. Given that most people on linux systems tend to install
CUDA outside their package manager, this implies we need to pass the
path to the compiler.

The runtime_compile.nim file contains the following variables:

## Path to your CUDA installation. Currently a `strdefine`, will likely change in the future
const CudaPath {.strdefine.} = "/usr/local/cuda-12.6/targets/x86_64-linux/"

## NOTE: We need to define the paths to our CUDA installation at compile time,
## because we need to use the `{.header: ...}` pragma for the CUDA wrapper.
## We might move to a 'supply your own `nim.cfg` defining them' approach in the future.
{.passC: "-I" & CudaPath & "/include".}
{.passL: "-L" & CudaPath & "/lib -lcuda".}

You can compile a program using -d:CudaPath=<path/to/your/cuda> to set
the paths accordingly.

CUDA code generator

This brings us to the most interesting part of this PR. In the example
above we simply had a string of raw CUDA code. But for anyone who tends
to write Nim, this is likely not the most attractive nor elegant
solution. So instead for the Saxpy example from above, we can write:

const Saxpy = cuda:
  proc saxpy(res: ptr UncheckedArray[float32],
             a: float32,
             x, y: ptr UncheckedArray[float32], n: csize_t) {.global.} =
    let tid = blockIdx.x * blockDim.x + threadIdx.x
    if tid < n:
      res[tid] = a * x[tid] + y[tid]

Due to the anyhow somewhat restricted nature of writing CUDA code, the
vast majority of practical code is already supported. You likely won't
think about CUDA devices for complex string handling or ref object
madness as your first choice. Note that the features you'd expect to see
all work. We can access arrays, we have more sane types
(UncheckedArray instead of raw pointers), can access the CUDA special
block / thread related variables etc. The latter is implemented by
defining dummy types in runtime_compile.nim, which are only there to
make the Nim compiler as part of the typed macro pass happy. Also,
typical CUDA annotations like __global__ are mapped to Nim pragmas as
you can see.

Important Nim features that are currently not supported:

generics
implicit result variable
inbuilt Nim types seq, string etc.
on device openArray
while loops (simple)
case statements (should be straightforward, but likely not very
useful)
echo on device (but you can printf, see below!)
we'll likely want a staticFor. Constantine's is currently slightly
broken in the macro.
…?

Important Nim features that do work:

if statements
for loops
basic Nim objects
passing seq[T] (for T being value types!) to a kernel (technically
a feature of execCuda) and using seq[T] as a return type
you can define and use templates in the cuda macro.
- if you need access to constants defined outside of the cuda
  macro, you can create a template with a static body accessing
  the constant. The template body will be replaced by the constant
  value
- macros should work fine, too.
native Nim types are mapped to CUDA types (no need to use cfloat,
cint etc)
you can use when statements to avoid a runtime branch
most standard features (array indexing, member access, address of,
dereferencing, casting, object constructors, …)
- caveat for object constructors: In CUDA you cannot assign a
  statically sized array from a runtime value (or C / C++ for that
  matter). So BigInt(limbs: someArray) is invalid. You'll need
  to memcpy / manually assign data. Statically known arrays work
  though.

Important CUDA features currently not supported:

__shared__ memory (just needs to implement the pragma)
__synchthreads and similar functions (also just need a Nim name
for them and then map them to their CUDA name)
…?

Important CUDA features that do work:

access to blockIdx, blockDim, threadIdx
__global__, __device__, __forceinline__ pragmas (via
equivalent Nim pragmas without the _
inline PTX via Nim's asm statement
you can printf on device (obviously only use this to debug)
memcpy

Notes on the `cuda` design

Initially I started out with an untyped macro and thought I'd just
have the Nim code be only one layer above being a pure string literal.
Essentially just mapping Nim constructs directly to fixed strings. But I
quickly realized that having a typed macro would be much better,
because we could actually access type information and use templates in
the body (as they are expanded before the typed macro is executed!).

I think it is likely possible to go one step further than the
current code and access Nim procs defined outside the cuda
macro, as long as they are in scope (and not overloaded!). With a
typed macro we can get its body, insert it into the CUDA context and
treat them as __device__ functions.

I mainly think about this not really for the purpose of sharing lots of
code between the CUDA target and other targets. While the code sharing
could theoretically be quite beneficial, I think likely it won't be very
practical. Most likely different targets require a very different
approach in many details. E.g. the low level primitives using inline PTX
instructions. At a higher level one will need different approaches due
to the trade offs needed for efficient parallelism on Nvidia GPUs
compared to a CPU approach.

However, what I do think would be very useful is to be able to split the
cuda macro into multiple pieces (similar to how one writes Nim macros
really). Say one cuda call for type definitions, one for some device
functions etc. But due to the typed nature, this implies all the defined
types and functions would need to be visible in a global scope, which
currently would not be the case.

Profiling Nvidia code

Although it is probably obvious, it is worth mentioning that you can of
course use an Nvidia profiler (nvprof or ncu) on Nim binaries, which
use this feature.

A more complex example

For a more complex example, see the BigInt example file part of this PR.
There we implement modular addition for finite field elements, similar
to the current existing implementation for the LLVM target (using inline
PTX instructions).

It shows how one defines a type on the CUDA device, accesses a constant
from Constantine (the field modulus) using a template with a static
body, how to construct objects on device and more. You'll see that the
code essentially looks like normal host code.

Be aware of course, to actually achieve really high performance, just
launching lots of blocks with many threads won't give you an O(1-10k)
(depending on # of CUDA cores) speedup over a single thread. You'll
need to make sure to first go down the rabbit hole of thinking about
memory coalescence, blocks, warps and all that… As an example, a simple
benchmark performing additions of 2^25 pairs of finite field elements
of BN254_Snarks using a BigInt type, which stores 8 uint32 limbs
leads to only a 10x speedup compared to a single CPU core (using our
very optimized CPU code of course). nvprof shows that the memory
performance in that case is only 12.5%, because each thread has to jump
over the 8 limbs of the neighboring threads/lanes. This leads to non
coalesced memory access and causes a massive performance penalty. I
mention this in particular, because to implement a stucture of array
(SoA) approach for the data (where we have a single BigInts type,
which has one array for limb 0, one for limb 1 and so on) is currently
not supported in the context of copying data to the device via
execCuda. We need to extend the "when and what to copy" logic in the
macro first.

Vindaar added 2 commits November 29, 2024 12:02

commit the initial Nim ⇒ CUDA DSL, NVRTC & CUDA execution helpers

582bfce

[tests/examples] add example for a BigInt modular addition

5384b9e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nvidia target via NVRTC and Nim ↦ CUDA DSL #487

Nvidia target via NVRTC and Nim ↦ CUDA DSL #487

Vindaar commented Nov 29, 2024

Nvidia target via NVRTC and Nim ↦ CUDA DSL #487

Are you sure you want to change the base?

Nvidia target via NVRTC and Nim ↦ CUDA DSL #487

Conversation

Vindaar commented Nov 29, 2024

Table of contents & introduction

CUDA execution

NVRTC compiler helper

Important note about CUDA library

CUDA code generator

Notes on the cuda design

Profiling Nvidia code

A more complex example

Notes on the `cuda` design