Nvidia target via NVRTC and Nim ↦ CUDA DSL #487
Draft
+1,595
−0
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
(Note: this is a draft, because it is a) a proof of concept and b) still depends on
nimcuda
for simplicity)Table of contents & introduction
library
cuda
designThis (draft) PR adds an experimental alternative to generating code
targeting Nvidia GPUs. Instead of relying on LLVM to generate Nvidia PTX
instructions, this PR adds 3 pieces:
defined in
codegen_nvidia.nim
:https://github.com/mratsim/constantine/blob/master/constantine/math_compiler/codegen_nvidia.nim
(if we decide to go ahead with this PR, I'll merge the two. They are
compatible, the new one just has a few extra features),
library) compiler and compile a string of CUDA code,
A few words on each of these first:
CUDA execution
Starting with this as it is already present in Constantine. Once one has
a compiled CUDA kernel and wishes to execute it, in principle one needs
to:
are not pure value types
cuLaunchKernel
making sure to pass all parameters as an arrayof pointers
Instead of having to do this manually, we use a typed macro, which
determines the required action based on the parameters passed to it.
The basic usage looks like:
where
res
andinputs
are tuples (to support heterogeneous types).Arguments passed as
res
are treated as output parameters. They willboth be copied to the device and afterwards back to the local
identifiers.
inputs
will either be passed by value or copied, depending on if thedata is
ref
type or not. NOTE: A currently not implemented feature isdeep copying data structures, which contain references / pointers
themselves. This is important in particular if one wishes to pass data
as a struct of arrays (SoA).
In practice in the context of the code of this PR, you don't directly
interact with
execCuda
. This is done via the NVRTC compiler in thenext section.
NOTE: The parameters will be passed in the order:
res
tuple in the tuple orderinputs
tuple in their orderThis means that your output arguments must be the first arguments of the
kernel currently!
NVRTC compiler helper
This is essentially an equivalent of the LLVM based
NvidiaAssembler
part of the LLVM backend,
https://github.com/mratsim/constantine/blob/master/constantine/math_compiler/codegen_nvidia.nim#L501-L512
Similarly to all CUDA work, lots of boilerplate code is required to
initialize the device, set up the compilation pipeline, call the
compiler on the CUDA string etc. As most of this is identical in the
majority of use cases, we can automate it away. NOTE: We will likely
want to eventually add some context or config object to store e.g.
specific parameters to pass to the NVRTC compiler for example.
As an example, let's look at what the Saxpy
example
from the CUDA documentation looks like for us now.
Clearly, most of the steps (
compile
,getPTtx
) could also just bedone as part of the
execute
. I just haven't merged them yet.We can see that the majority of the code is now setting up the input
data for the kernel.
Important note about CUDA library
To fully support all CUDA features using NVRTC, we need to use the
header
pragma in the CUDA wrapper. See thisnimcuda
issue about theproblem:
SciNim/nimcuda#27
(Note: the current existing CUDA wrapper in Constantine also avoids the
header
pragma. Once we switch to using our own, we'll have to makethat change and thus need the code below)
This implies that we need to know the path to the CUDA libraries at
compile time. Given that most people on linux systems tend to install
CUDA outside their package manager, this implies we need to pass the
path to the compiler.
The
runtime_compile.nim
file contains the following variables:You can compile a program using
-d:CudaPath=<path/to/your/cuda>
to setthe paths accordingly.
CUDA code generator
This brings us to the most interesting part of this PR. In the example
above we simply had a string of raw CUDA code. But for anyone who tends
to write Nim, this is likely not the most attractive nor elegant
solution. So instead for the Saxpy example from above, we can write:
Due to the anyhow somewhat restricted nature of writing CUDA code, the
vast majority of practical code is already supported. You likely won't
think about CUDA devices for complex string handling or ref object
madness as your first choice. Note that the features you'd expect to see
all work. We can access arrays, we have more sane types
(
UncheckedArray
instead of raw pointers), can access the CUDA specialblock / thread related variables etc. The latter is implemented by
defining dummy types in
runtime_compile.nim
, which are only there tomake the Nim compiler as part of the typed macro pass happy. Also,
typical CUDA annotations like
__global__
are mapped to Nim pragmas asyou can see.
Important Nim features that are currently not supported:
result
variableseq
,string
etc.openArray
while
loops (simple)case
statements (should be straightforward, but likely not veryuseful)
echo
on device (but you canprintf
, see below!)staticFor
. Constantine's is currently slightlybroken in the macro.
Important Nim features that do work:
if
statementsfor
loopsseq[T]
(for T being value types!) to a kernel (technicallya feature of
execCuda
) and usingseq[T]
as a return typetemplates
in thecuda
macro.cuda
macro, you can create a template with a
static
body accessingthe constant. The template body will be replaced by the constant
value
cfloat
,cint
etc)when
statements to avoid a runtime branchdereferencing, casting, object constructors, …)
statically sized array from a runtime value (or C / C++ for that
matter). So
BigInt(limbs: someArray)
is invalid. You'll needto
memcpy
/ manually assign data. Statically known arrays workthough.
Important CUDA features currently not supported:
__shared__
memory (just needs to implement the pragma)__synchthreads
and similar functions (also just need a Nim namefor them and then map them to their CUDA name)
Important CUDA features that do work:
blockIdx
,blockDim
,threadIdx
__global__
,__device__
,__forceinline__
pragmas (viaequivalent Nim pragmas without the
_
asm
statementprintf
on device (obviously only use this to debug)memcpy
Notes on the
cuda
designInitially I started out with an
untyped
macro and thought I'd justhave the Nim code be only one layer above being a pure string literal.
Essentially just mapping Nim constructs directly to fixed strings. But I
quickly realized that having a
typed
macro would be much better,because we could actually access type information and use templates in
the body (as they are expanded before the typed macro is executed!).
I think it is likely possible to go one step further than the
current code and access Nim procs defined outside the
cuda
macro, as long as they are in scope (and not overloaded!). With a
typed macro we can get its body, insert it into the CUDA context and
treat them as
__device__
functions.I mainly think about this not really for the purpose of sharing lots of
code between the CUDA target and other targets. While the code sharing
could theoretically be quite beneficial, I think likely it won't be very
practical. Most likely different targets require a very different
approach in many details. E.g. the low level primitives using inline PTX
instructions. At a higher level one will need different approaches due
to the trade offs needed for efficient parallelism on Nvidia GPUs
compared to a CPU approach.
However, what I do think would be very useful is to be able to split the
cuda
macro into multiple pieces (similar to how one writes Nim macrosreally). Say one
cuda
call for type definitions, one for some devicefunctions etc. But due to the typed nature, this implies all the defined
types and functions would need to be visible in a global scope, which
currently would not be the case.
Profiling Nvidia code
Although it is probably obvious, it is worth mentioning that you can of
course use an Nvidia profiler (
nvprof
orncu
) on Nim binaries, whichuse this feature.
A more complex example
For a more complex example, see the BigInt example file part of this PR.
There we implement modular addition for finite field elements, similar
to the current existing implementation for the LLVM target (using inline
PTX instructions).
It shows how one defines a type on the CUDA device, accesses a constant
from Constantine (the field modulus) using a
template
with astatic
body, how to construct objects on device and more. You'll see that the
code essentially looks like normal host code.
Be aware of course, to actually achieve really high performance, just
launching lots of blocks with many threads won't give you an O(1-10k)
(depending on # of CUDA cores) speedup over a single thread. You'll
need to make sure to first go down the rabbit hole of thinking about
memory coalescence, blocks, warps and all that… As an example, a simple
benchmark performing additions of 2^25 pairs of finite field elements
of
BN254_Snarks
using aBigInt
type, which stores 8uint32
limbsleads to only a 10x speedup compared to a single CPU core (using our
very optimized CPU code of course).
nvprof
shows that the memoryperformance in that case is only 12.5%, because each thread has to jump
over the 8 limbs of the neighboring threads/lanes. This leads to non
coalesced memory access and causes a massive performance penalty. I
mention this in particular, because to implement a stucture of array
(SoA) approach for the data (where we have a single
BigInts
type,which has one array for limb 0, one for limb 1 and so on) is currently
not supported in the context of copying data to the device via
execCuda
. We need to extend the "when and what to copy" logic in themacro first.