Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nvidia target via NVRTC and Nim ↦ CUDA DSL #487

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

Vindaar
Copy link
Collaborator

@Vindaar Vindaar commented Nov 29, 2024

(Note: this is a draft, because it is a) a proof of concept and b) still depends on nimcuda for simplicity)

Table of contents & introduction

This (draft) PR adds an experimental alternative to generating code
targeting Nvidia GPUs. Instead of relying on LLVM to generate Nvidia PTX
instructions, this PR adds 3 pieces:

  • a slightly modified version of the existing "CUDA execution helpers"
    defined in codegen_nvidia.nim:
    https://github.com/mratsim/constantine/blob/master/constantine/math_compiler/codegen_nvidia.nim
    (if we decide to go ahead with this PR, I'll merge the two. They are
    compatible, the new one just has a few extra features),
  • a helper file to initialize an NVRTC (Nvidia runtime compilation
    library) compiler and compile a string of CUDA code,
  • a DSL to generate a CUDA code string from a Nim macro.

A few words on each of these first:

CUDA execution

Starting with this as it is already present in Constantine. Once one has
a compiled CUDA kernel and wishes to execute it, in principle one needs
to:

  • allocate memory for all arguments to be passed to the kernel, which
    are not pure value types
  • memcopy the data from host to device
  • call cuLaunchKernel making sure to pass all parameters as an array
    of pointers
  • copy output data back from device to host
  • free memory

Instead of having to do this manually, we use a typed macro, which
determines the required action based on the parameters passed to it.

The basic usage looks like:

execCuda("someKernel", numBlocks, threadsPerBlock, res, inputs)

where res and inputs are tuples (to support heterogeneous types).

Arguments passed as res are treated as output parameters. They will
both be copied to the device and afterwards back to the local
identifiers.

inputs will either be passed by value or copied, depending on if the
data is ref type or not. NOTE: A currently not implemented feature is
deep copying data structures, which contain references / pointers
themselves. This is important in particular if one wishes to pass data
as a struct of arrays (SoA).

In practice in the context of the code of this PR, you don't directly
interact with execCuda. This is done via the NVRTC compiler in the
next section.

NOTE: The parameters will be passed in the order:

  • first all elements of the res tuple in the tuple order
  • then all elements of the inputs tuple in their order

This means that your output arguments must be the first arguments of the
kernel currently!

NVRTC compiler helper

This is essentially an equivalent of the LLVM based NvidiaAssembler
part of the LLVM backend,

https://github.com/mratsim/constantine/blob/master/constantine/math_compiler/codegen_nvidia.nim#L501-L512

Similarly to all CUDA work, lots of boilerplate code is required to
initialize the device, set up the compilation pipeline, call the
compiler on the CUDA string etc. As most of this is identical in the
majority of use cases, we can automate it away. NOTE: We will likely
want to eventually add some context or config object to store e.g.
specific parameters to pass to the NVRTC compiler for example.

As an example, let's look at what the Saxpy
example

from the CUDA documentation looks like for us now.

import runtime_compile

# The actual CUDA kernel code from the example
# NOTE: Compared to the Saxpy documentation example, our kernel uses
# the `out` paramater as the first argument. This is because `execCuda`
# passes the arguments in the order 'all `res`, all `inputs`'.
const Saxpy = cstring"""
extern "C" __global__
void saxpy(float *out, float a, float *x, float *y, size_t n)
{
   size_t tid = blockIdx.x * blockDim.x + threadIdx.x;
   if (tid < n) {
      out[tid] = a * x[tid] + y[tid];
   }
}
"""

proc main =
  var nvrtc = initNvrtc(Saxpy) # init the NVRTC helper with the string
  nvrtc.compile()              # compile the code
  nvrtc.getPtx()               # get the PTX of the code

  # Configure kernel launch parameters
  nvrtc.numBlocks = 32
  nvrtc.threadsPerBlock = 128

  const n = 128 * 32
  let bufferSize = n * csize_t(sizeOf cfloat)
  var a = cfloat 5.1
  var
    hX: array[n, cfloat]
    hY: array[n, cfloat]
    hOut: array[n, cfloat]
  for i in 0 ..< n: # Initialize host data (fill hX and hY with ur data)
    hX[i] = cfloat(i)
    hY[i] = cfloat(i * 2)

  ## Execute SAXPY.
  # `saxpy` is the name of the kernel
  # `hOut` is the output array, `a, hX, hY, n` are inputs
  nvrtc.execute("saxpy", (hOut), (a, hX, hY, n)) 
  for i in 0 ..< n:
    echo fmt"{a} * {hX[i]} + {hY[i]} = {hOut[i]}"

when isMainModule:
  main()

Clearly, most of the steps (compile, getPTtx) could also just be
done as part of the execute. I just haven't merged them yet.

We can see that the majority of the code is now setting up the input
data for the kernel.

Important note about CUDA library

To fully support all CUDA features using NVRTC, we need to use the
header pragma in the CUDA wrapper. See this nimcuda issue about the
problem:

SciNim/nimcuda#27

(Note: the current existing CUDA wrapper in Constantine also avoids the
header pragma. Once we switch to using our own, we'll have to make
that change and thus need the code below)

This implies that we need to know the path to the CUDA libraries at
compile time. Given that most people on linux systems tend to install
CUDA outside their package manager, this implies we need to pass the
path to the compiler.

The runtime_compile.nim file contains the following variables:

## Path to your CUDA installation. Currently a `strdefine`, will likely change in the future
const CudaPath {.strdefine.} = "/usr/local/cuda-12.6/targets/x86_64-linux/"

## NOTE: We need to define the paths to our CUDA installation at compile time,
## because we need to use the `{.header: ...}` pragma for the CUDA wrapper.
## We might move to a 'supply your own `nim.cfg` defining them' approach in the future.
{.passC: "-I" & CudaPath & "/include".}
{.passL: "-L" & CudaPath & "/lib -lcuda".}

You can compile a program using -d:CudaPath=<path/to/your/cuda> to set
the paths accordingly.

CUDA code generator

This brings us to the most interesting part of this PR. In the example
above we simply had a string of raw CUDA code. But for anyone who tends
to write Nim, this is likely not the most attractive nor elegant
solution. So instead for the Saxpy example from above, we can write:

const Saxpy = cuda:
  proc saxpy(res: ptr UncheckedArray[float32],
             a: float32,
             x, y: ptr UncheckedArray[float32], n: csize_t) {.global.} =
    let tid = blockIdx.x * blockDim.x + threadIdx.x
    if tid < n:
      res[tid] = a * x[tid] + y[tid]

Due to the anyhow somewhat restricted nature of writing CUDA code, the
vast majority of practical code is already supported. You likely won't
think about CUDA devices for complex string handling or ref object
madness as your first choice. Note that the features you'd expect to see
all work. We can access arrays, we have more sane types
(UncheckedArray instead of raw pointers), can access the CUDA special
block / thread related variables etc. The latter is implemented by
defining dummy types in runtime_compile.nim, which are only there to
make the Nim compiler as part of the typed macro pass happy. Also,
typical CUDA annotations like __global__ are mapped to Nim pragmas as
you can see.

Important Nim features that are currently not supported:

  • generics
  • implicit result variable
  • inbuilt Nim types seq, string etc.
  • on device openArray
  • while loops (simple)
  • case statements (should be straightforward, but likely not very
    useful)
  • echo on device (but you can printf, see below!)
  • we'll likely want a staticFor. Constantine's is currently slightly
    broken in the macro.
  • …?

Important Nim features that do work:

  • if statements
  • for loops
  • basic Nim objects
  • passing seq[T] (for T being value types!) to a kernel (technically
    a feature of execCuda) and using seq[T] as a return type
  • you can define and use templates in the cuda macro.
    • if you need access to constants defined outside of the cuda
      macro, you can create a template with a static body accessing
      the constant. The template body will be replaced by the constant
      value
    • macros should work fine, too.
  • native Nim types are mapped to CUDA types (no need to use cfloat,
    cint etc)
  • you can use when statements to avoid a runtime branch
  • most standard features (array indexing, member access, address of,
    dereferencing, casting, object constructors, …)
    • caveat for object constructors: In CUDA you cannot assign a
      statically sized array from a runtime value (or C / C++ for that
      matter). So BigInt(limbs: someArray) is invalid. You'll need
      to memcpy / manually assign data. Statically known arrays work
      though.

Important CUDA features currently not supported:

  • __shared__ memory (just needs to implement the pragma)
  • __synchthreads and similar functions (also just need a Nim name
    for them and then map them to their CUDA name)
  • …?

Important CUDA features that do work:

  • access to blockIdx, blockDim, threadIdx
  • __global__, __device__, __forceinline__ pragmas (via
    equivalent Nim pragmas without the _
  • inline PTX via Nim's asm statement
  • you can printf on device (obviously only use this to debug)
  • memcpy

Notes on the cuda design

Initially I started out with an untyped macro and thought I'd just
have the Nim code be only one layer above being a pure string literal.
Essentially just mapping Nim constructs directly to fixed strings. But I
quickly realized that having a typed macro would be much better,
because we could actually access type information and use templates in
the body (as they are expanded before the typed macro is executed!).

I think it is likely possible to go one step further than the
current code and access Nim procs defined outside the cuda
macro, as long as they are in scope (and not overloaded!). With a
typed macro we can get its body, insert it into the CUDA context and
treat them as __device__ functions.

I mainly think about this not really for the purpose of sharing lots of
code between the CUDA target and other targets. While the code sharing
could theoretically be quite beneficial, I think likely it won't be very
practical. Most likely different targets require a very different
approach in many details. E.g. the low level primitives using inline PTX
instructions. At a higher level one will need different approaches due
to the trade offs needed for efficient parallelism on Nvidia GPUs
compared to a CPU approach.

However, what I do think would be very useful is to be able to split the
cuda macro into multiple pieces (similar to how one writes Nim macros
really). Say one cuda call for type definitions, one for some device
functions etc. But due to the typed nature, this implies all the defined
types and functions would need to be visible in a global scope, which
currently would not be the case.

Profiling Nvidia code

Although it is probably obvious, it is worth mentioning that you can of
course use an Nvidia profiler (nvprof or ncu) on Nim binaries, which
use this feature.

A more complex example

For a more complex example, see the BigInt example file part of this PR.
There we implement modular addition for finite field elements, similar
to the current existing implementation for the LLVM target (using inline
PTX instructions).

It shows how one defines a type on the CUDA device, accesses a constant
from Constantine (the field modulus) using a template with a static
body, how to construct objects on device and more. You'll see that the
code essentially looks like normal host code.

Be aware of course, to actually achieve really high performance, just
launching lots of blocks with many threads won't give you an O(1-10k)
(depending on # of CUDA cores) speedup over a single thread. You'll
need to make sure to first go down the rabbit hole of thinking about
memory coalescence, blocks, warps and all that… As an example, a simple
benchmark performing additions of 2^25 pairs of finite field elements
of BN254_Snarks using a BigInt type, which stores 8 uint32 limbs
leads to only a 10x speedup compared to a single CPU core (using our
very optimized CPU code of course). nvprof shows that the memory
performance in that case is only 12.5%, because each thread has to jump
over the 8 limbs of the neighboring threads/lanes. This leads to non
coalesced memory access and causes a massive performance penalty. I
mention this in particular, because to implement a stucture of array
(SoA) approach for the data (where we have a single BigInts type,
which has one array for limb 0, one for limb 1 and so on) is currently
not supported in the context of copying data to the device via
execCuda. We need to extend the "when and what to copy" logic in the
macro first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant