Prepare release v0.2.0

ahrefs · Jun 3, 2023 · 60ee882 · 60ee882
1 parent c2a2073
commit 60ee882
Show file tree

Hide file tree

Showing 4 changed files with 41 additions and 5 deletions.
diff --git a/CHANGES.md b/CHANGES.md
@@ -1,3 +1,32 @@
+## [0.2.0] -- 2023-06-03
+
+### Added
+
+- The Gccjit backend operates using "on device" copies of tensors, where the "device memory" is the stack of the C function. This is intended to improve cache locality and reduce cache contention.
+  - Three / four synchronization heuristics:
+    - "parallel": a slice of the tensor is copied host-to-device at the beginning and device-to-host at the end, without interference because each task has a different slice.
+    - "update on host": the tensor is copied host-to-device at the beginning; each write is an update, it reads the old value from host to update it on the host. Thus each write is a synchronization point.
+    - "replicated": the tensor is copied host-to-device at the beginning; only task 0 copies device-to-host.
+    - "device-only": no copying to/from host.
+- On-device-only tensors that are not materialized on the OCaml side.
+- A new category of axis dimensions is introduced: `Frozen`. It is analogous to the `Parallel` axis category in that a single task execution / "device call" only processes a 1D slice of the axis.
+  - Currently, for tensors processed in parallel, we only support processing of a contiguous tensor slice (copied "to device" using `memcpy`).
+- A new syntax `%nn_rs` ("postprocess results" variant of `%nn_dt`) for computations that should happen at the end of task execution / refresh step. It's meant to prepare the data to be copied back to the host.
+
+### Changed
+
+- Got rid of backend-agnostic synchronization. It was not worth the complexity / implementation effort at this point.
+  - Keeping the `Rebalance` constructor around, but it is not playing any role.
+- Got rid of `debug_virtual_nodes`, was tricky to maintain.
+- Dynamic indexing now skips over parallel axes: when there is a `Parallel` axis on the left, it is preserved in the resulting tensor (slice), and the next-right axis is indexed into instead.
+  - Removed the "indexing axes from-right" functionality for now (fails as not implemented).
+- Dynamic indexing now can produce virtual nodes.
+
+### Fixed
+
+- Dynamic indexing fixes.
+
+
 ## [0.1.2] -- 2023-05-12
 
 ### Added

diff --git a/README.md b/README.md
@@ -30,16 +30,23 @@ Warning disclaimer: this project is still "not announced". The features describe
 
 ## Future milestones
 
-For past milestones see [CHANGES](CHANGES.md).
-
-* Skipping v0.2.
 * **v0.3-GPU**: a CUDA backend.
 * **v0.3.1-tiling**: the tiling optimization.
 * **v0.4-usability**: examples covering most of Andrej Karpathy's "Neural Networks Zero to Hero" series; data loading; checkpointing.
 * **v0.5-documentation**: `.mli` files and maybe more documentation.
 * **v0.6-scale**: distributed computation; runtime-autotuning optimization settings.
 * **v1-completeness**: whatever not-yet-implemented features that still seem needed and impact the framework design. (E.g. at the time of v0.1.X, convolutions, reshaping, concatenation are not easily expressible.)
 
+### Releases
+
+For details, see [CHANGES](CHANGES.md).
+
+* **v0.2**: for multicore CPU, improve cache locality and reduce cache contention by treating the C function stack as the "device memory".
+* **v0.1.2**: multicore computations using a thread-local "task id" index.
+* **v0.1.1**: inlining scalar constants, improved inlining for virtual nodes.
+* **v0.1.0**: a `Gccjit` backend, single and double precision floats, code compiled as a monolithic update step function.
+
+
 ## Why not just use [OWL](https://ocaml.xyz/)?
 
 OCANNL follows different design choices than [OWL](https://ocaml.xyz/). For example:

diff --git a/dune-project b/dune-project
@@ -4,7 +4,7 @@
 
 (name ocannl)
 
-(version 0.1.2)
+(version 0.2.0)
 
 (generate_opam_files true)
 

diff --git a/ocannl.opam b/ocannl.opam
@@ -1,6 +1,6 @@
 # This file is generated by dune, edit dune-project instead
 opam-version: "2.0"
-version: "0.1.2"
+version: "0.2.0"
 synopsis:
   "A from-scratch Deep Learning library with CUDA, operator fusion, staged compilation, backprop"
 description: "A longer description"