Release Half precision, mixed precision, CUDA virtual devices · ahrefs/ocannl

The release 0.4.1 offers: half precision, mixed precision, proper support for cuda virtual devices, and many bug fixes.

From the CHANGELOG:

Implemented the previously-mocked support for half precision (FP16).
- We work around the missing Ctypes coverage by not using Ctypes.bigarray_start.
- We check FP16 constants for overflow.
- We output half precision specific code from the CUDA backend.
Finally proper support for mixed precision! Lazy precision defaults and delayed precision setting via Tnode.update_prec.
A placeholder nn_blocks.ml hinting at an intended design pattern for model components.
A memory model for the multiple virtual devices per physical device setup, implemented in the CUDA backend. It fixes the CUDA backend behavior in the data parallelism benchmark.
Slides for the Fun OCaml meetup: docs/Fun OCaml.
New syntax: inline tensor declarations with a literal float as initial value.

Removed the pipes_cc, pipes_gccjit backends (Pipes_multicore_backend) -- I had fixed Pipes_multicore_backend by using the poll library instead of Unix.select, but it turns out to be very very slow.
Changed the %cd block comment syntax ~~ to allow detailed structuring. Rewrote Train.grad_update to use the %cd syntax.
Made Train.sgd_one slightly more thrifty: p =- learning_rate *. sgd_delta --> p =- learning_rate * sgd_delta ~logic:"." without the inline tensor expression.

Log levels related de-confusion:
- Critical bug: logging of computation traces was not properly converted to ppx_minidebug 2.0.
- Properly restore log_level and inform about its setting.
- By default do not log from tests.
- debug_log_from_routines should only happen when log_level > 1.
Bugs in Multicore_backend: await was not checking queue emptiness, worker's Condition.broadcast was non-atomically guarded (doesn't need to be), possible deadloop due to the lockfree queue -- now replaced with saturn_lockfree.
Reduced busy-waiting inside c_compile_and_load, propagating compilation errors now instead of infinite loop on error.
Fixed loss of significant digits for small numbers when outputting files.
Added missing mixed-precision conversions in the C_syntax backend builder.
Restored the functionality of debug logging from the cuda backend.
Always reinitialize global state at the beginning of let%expect_test, to make them more deterministic.

Provide feedback