Naive Cuda (tagged for archival purposes)
Pre-release
Pre-release
Cuda FFI, naive, not particularly functional Cuda backend where a "parallel" axis is mapped across blocks and a "minibatch" axis is mapped across threads in a block.
This does not really work because it lacks synchronization across blocks. Also the "parallel axis", "minibatch axis" approach is not really usable (neither for Cuda nor the Gccjit backend).
When using too many total threads, Cuda hangs / takes too long on compilation to PTX. Where the Cuda backend works, the Gccjit backend is way faster.
Other meaningful improvements include: low-level code optimization / simplification; refactorings.