Release Naive Cuda (tagged for archival purposes) · ahrefs/ocannl

Cuda FFI, naive, not particularly functional Cuda backend where a "parallel" axis is mapped across blocks and a "minibatch" axis is mapped across threads in a block.

This does not really work because it lacks synchronization across blocks. Also the "parallel axis", "minibatch axis" approach is not really usable (neither for Cuda nor the Gccjit backend).

When using too many total threads, Cuda hangs / takes too long on compilation to PTX. Where the Cuda backend works, the Gccjit backend is way faster.

Other meaningful improvements include: low-level code optimization / simplification; refactorings.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Naive Cuda (tagged for archival purposes)