From c323f1a83fe14373aadd189a9a92297686e076be Mon Sep 17 00:00:00 2001
From: Andrew Radcliffe <andrewjradcliffe@gmail.com>
Date: Tue, 21 Nov 2023 18:50:01 -0800
Subject: [PATCH 1/5] Draft, prior to addition of Prior Art section

---
 designs/0000-cmdstanrs.md | 233 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 233 insertions(+)
 create mode 100644 designs/0000-cmdstanrs.md

diff --git a/designs/0000-cmdstanrs.md b/designs/0000-cmdstanrs.md
new file mode 100644
index 0000000..1805bc7
--- /dev/null
+++ b/designs/0000-cmdstanrs.md
@@ -0,0 +1,233 @@
+- Feature Name: (fill me in with a unique ident, my_awesome_feature)
+- Start Date: (fill me in with today's date, YYYY-MM-DD)
+- RFC PR: (leave this empty)
+- Stan Issue: (leave this empty)
+
+# Summary
+[summary]: #summary
+
+This is a proposal for a Rust interface for CmdStan through compiled
+executables, that is, no direct interface with C++.
+
+The goal is to provide an interface which enables users to:
+- compile Stan programs (with arbitrary options)
+- build and compose arguments/options (to be passed to C++
+  executables) in idiomatic Rust
+- call C++ executables, then memoize input and collect output (thereby
+  making these available for programmatic use)
+- call `diagnose` and `stansummary` tools and collect output
+
+# Motivation
+[motivation]: #motivation
+
+Suppose that you write Rust code. Suppose that you use `Stan` for
+probabilistic programming. You have three choices for creating an
+application which utilizes both: shell scripts, introduce a scripting
+language (e.g. Python) dependency, or a Rust interface for `CmdStan`.
+
+Orchestration using the shell suffers from portability issues and
+often leads unnecessary fragmentation of control flow. Introducing a
+scripting language may confer portability, but control flow and error
+handling are now divided between two languages; furthermore, this
+necessitates code be written to serialize/deserialize intermediates.
+
+A Rust interface, in similar spirit to the CmdStan interfaces from
+other languages, would provide the necessary abstraction to eliminate
+the aforementioned problems.
+
+# Functional Specification
+
+Given a Stan program, a user of the library will compile the model (if
+desired), call the executable with arguments (translated from
+strongly-typed argument tree), and obtain a self-contained context
+which encapsulates the pertinent information from the call.
+
+## Assumptions
+
+We assume (at our peril) that Rust programmers that will be able to
+figure out how to satisfy the following requirement:
+- a working CmdStan installation exists at some user-accessible path
+
+## Control: compilation and calling the resultant executable
+
+A `CmdStanModel` type will serve as an abstraction for a Stan program,
+which may need to be compiled. Rather than compile on construction, a
+compilation method must be explicitly called in user code (assuming
+that a satisfactory executable does not exist yet).
+
+Two arguments will be necessary to create a `CmdStanModel`:
+1. a path to a CmdStan installation
+2. a path to a Stan program
+
+Methods (receiver: `CmdStanModel` instance) exposed to the user will
+include:
+- `validate_cmdstan` : determine whether the CmdStan installation works
+- `executable_works` : is there a working executable at the path
+  implied by the Stan file?
+- `compile_with_args` : attempt to compile a Stan program with
+  optional `make` arguments
+- `call_executable` : call executable with the given argument tree; on
+  success, return a `CmdStanOutput` instance.
+
+## Output
+
+Output of a successful `call_executable` call on a `CmdStanModel` will
+produce a `CmdStanOuput` instance, which encapsulates the context of
+the call. This includes:
+- the console output (exit status, stdout and stderr),
+- the argument tree provided to `call_executable`
+- the current working directory of the process at the time the call was made
+- the CmdStan installation from the parent `CmdStanModel`
+
+The objective is for `CmdStanOutput` to be a self-contained record
+which includes all pertinent information from a successful executable
+call. This structure can then be used to direct calls of
+`diagnose`/`stansummary`. Naturally, methods on said type will be
+present to expose the context to the user and perform utility
+functions (e.g. return a list of output file paths).
+
+## Processes, IO
+
+The proposal is to use the Rust `std` library, in particular the
+[process](https://doc.rust-lang.org/std/process/index.html),
+[path](https://doc.rust-lang.org/std/path/index.html),
+[fs](https://doc.rust-lang.org/std/fs/index.html), and
+[ffi](https://doc.rust-lang.org/std/ffi/index.html) modules, to
+orchestrate processes, interact with file system and handle
+cross-platform concerns. This will yield a library which is portable,
+provided that it is (cross-)compiled for the intended target.
+
+## Arguments and options
+
+Stan provides several inference engines, each with a large number of
+options. CmdStan in turn handles this heterogeneity.
+
+To encapsulate the arguments passed at the command line (to a compiled
+executable), the proposal is a strongly-typed representation of this
+heterogeneity using a combination of sum types (Rust `enum`) and
+product types (Rust `struct`). By construction, this representation
+prevents the formation of inconsistent argument combinations -- the
+code simply won't compile. The resultant tree is an abstraction which
+enables the use of a single type (`CmdStanOutput`) to encapsulate a
+call to an executable.
+
+Unsurprisingly, the argument tree is a syntax tree for `CmdStan`
+command arguments. We translate to the very simple command line
+language, but leave open the possibility of translation to other
+languages.
+
+### Translation
+
+The (sloppy) productions for the command line language are:
+```text
+tree    -> terms
+terms   -> term " " term | term
+term    -> pair | product | sum
+pair    -> key "=" value
+product -> type " " pairs
+sum     -> type "=" variant " " terms | type "=" variant
+pairs   -> pairs " " pair | pair
+
+key     -> A
+A       -> A alpha | beta
+alpha   -> letter | digit | "_"
+beta    -> letter
+letter  -> "A" | ... | "z"
+digit   -> "0" | ... | "9"
+
+value   -> number | path
+```
+Where the productions for `number` and `path` are left out for brevity.
+The start symbol is `tree`. Generate the command line statement by
+folding the tree from left to right by generating the appropriate term
+from each node. 
+
+### Ergonomics
+
+The builder pattern will be implemented for each `struct`, and for
+each `enum` variant (excluding unit variants). This
+enables the user to supply only the arguments for which they desire
+non-default values. Philosophy: pay (in LOC) for only what you need.
+
+There is an incidental benefit (from my perspective) afforded by the
+strongly-typed representation: with
+[company-mode](https://github.com/company-mode/company-mode) and
+[eglot-mode](https://github.com/joaotavora/eglot)
+([lsp-mode](https://github.com/emacs-lsp/lsp-mode/) also works) in
+Emacs 28.2, one can view options at each node in the argument tree by
+code that looks something like the following:
+
+```rust
+ArgumentTree::builder(). // hover on the `.`
+```
+
+If one has a Rust language server and completion support in their
+editor, this is a free side effect. Whether it will help anyone
+is uncertain.
+
+### Coverage
+
+The objective is for the interface to cover all options which can be
+passed to a compiled Stan program, that is, all methods and all
+options for said methods.
+
+
+## Separation of concerns
+
+Other than the argument tree support, the interface proposed is very
+simple: the user can compile a Stan program, call it with arguments,
+get basic information from the output, and call
+`diagnose`/`stansummary`. Below, I provide the rationale for exclusion
+of two aspects.
+
+### Serialization
+
+It is trivial to provide a method such as 
+```rust
+fn `write_json<T: Serialize>(data: T, file: &Path) {}`
+```
+but it serves no purpose -- it does not enforce the conventions
+adopted for representing Stan data types in JSON (e.g. matrices
+represented as a vector of *row* vectors, not a vector of *column*
+vectors), hence, would likely lead to unexpected (and potentially
+silent!) errors.
+
+In order to develop a serializer which respects the conventions for
+Stan data types, one would need to declare conventions for the mapping
+of Rust data types to Stan data
+types. [serde_json](https://github.com/serde-rs/json) would be nice to
+use, but has some incompatibilities (Rust tuple is represented as an
+array, rather than an object).
+
+Moreover, to represent matrices, complex numbers, etc., one would need
+to support types from the crate ecosystem since the standard library
+lacks these -- [nalgebra](https://github.com/dimforge/nalgebra) and
+[num-complex](https://github.com/rust-num/num-complex) are reasonable
+choices, but nonetheless represent decisions to be made!
+
+From a design perspective, this is a great place to defer to the user,
+at least for the moment. A principled approach would involve writing a
+data format for [serde](https://serde.rs/)
+
+### Deserialization
+
+Parsing Stan CSVs to a strongly-typed representation is simple if one
+wishes to simply obtain a matrix of values (or `Vec<Vec<f64>>` if we
+limit ourselves to `std` library types). However, one needs to extract
+the variables from each line, thus, one needs to know the types (and
+their dimensions). A recursive definition of types using `enum`s and
+`struct`s could probably work to represent such a thing in Rust, but
+may not necessarily be particularly ergonomic (i.e. much unavoidable
+boilerplate would be needed to use the resultant type).
+
+Procedural macros, applied to a Stan program stored in a string
+literal in a Rust program, could be used to generate types and a
+parser for Stan CSVs produced said program. However, in order to
+implement such an idea, one would first need to adopt conventions for
+representing Stan data types using Rust data types. This requires
+careful thought and and is something best left for the future, if
+ever.
+
+The current proposal is for `CmdStanOutput` to be capable of returning
+paths to the files and the user parses them however they desire.
+This leaves open the possibility of multiple parsing strategies.

From 5d6456fec220136578f9eb05be648d77d5cad488 Mon Sep 17 00:00:00 2001
From: Andrew Radcliffe <andrewjradcliffe@gmail.com>
Date: Tue, 21 Nov 2023 18:52:58 -0800
Subject: [PATCH 2/5] Wishful numbering

---
 designs/{0000-cmdstanrs.md => 0035-cmdstanrs.md} | 0
 1 file changed, 0 insertions(+), 0 deletions(-)
 rename designs/{0000-cmdstanrs.md => 0035-cmdstanrs.md} (100%)

diff --git a/designs/0000-cmdstanrs.md b/designs/0035-cmdstanrs.md
similarity index 100%
rename from designs/0000-cmdstanrs.md
rename to designs/0035-cmdstanrs.md

From 1cbf5275cc86a83426cbf79962a97ff104bb1259 Mon Sep 17 00:00:00 2001
From: Andrew Radcliffe <andrewjradcliffe@gmail.com>
Date: Tue, 21 Nov 2023 19:02:56 -0800
Subject: [PATCH 3/5] Some obvious touch-ups after viewing the rendered
 markdown

---
 designs/0035-cmdstanrs.md | 40 +++++++++++++++++++--------------------
 1 file changed, 20 insertions(+), 20 deletions(-)

diff --git a/designs/0035-cmdstanrs.md b/designs/0035-cmdstanrs.md
index 1805bc7..4e019b8 100644
--- a/designs/0035-cmdstanrs.md
+++ b/designs/0035-cmdstanrs.md
@@ -1,7 +1,7 @@
-- Feature Name: (fill me in with a unique ident, my_awesome_feature)
-- Start Date: (fill me in with today's date, YYYY-MM-DD)
-- RFC PR: (leave this empty)
-- Stan Issue: (leave this empty)
+- Feature Name: cmdstanrs
+- Start Date: 2023-11-22
+- RFC PR:
+- Stan Issue:
 
 # Summary
 [summary]: #summary
@@ -42,12 +42,23 @@ desired), call the executable with arguments (translated from
 strongly-typed argument tree), and obtain a self-contained context
 which encapsulates the pertinent information from the call.
 
-## Assumptions
+### Assumptions
 
 We assume (at our peril) that Rust programmers that will be able to
 figure out how to satisfy the following requirement:
 - a working CmdStan installation exists at some user-accessible path
 
+### Processes, IO
+
+The proposal is to use the Rust `std` library, in particular the
+[process](https://doc.rust-lang.org/std/process/index.html),
+[path](https://doc.rust-lang.org/std/path/index.html),
+[fs](https://doc.rust-lang.org/std/fs/index.html), and
+[ffi](https://doc.rust-lang.org/std/ffi/index.html) modules, to
+orchestrate processes, interact with file system and handle
+cross-platform concerns. This will yield a library which is portable,
+provided that it is (cross-)compiled for the intended target.
+
 ## Control: compilation and calling the resultant executable
 
 A `CmdStanModel` type will serve as an abstraction for a Stan program,
@@ -86,17 +97,6 @@ call. This structure can then be used to direct calls of
 present to expose the context to the user and perform utility
 functions (e.g. return a list of output file paths).
 
-## Processes, IO
-
-The proposal is to use the Rust `std` library, in particular the
-[process](https://doc.rust-lang.org/std/process/index.html),
-[path](https://doc.rust-lang.org/std/path/index.html),
-[fs](https://doc.rust-lang.org/std/fs/index.html), and
-[ffi](https://doc.rust-lang.org/std/ffi/index.html) modules, to
-orchestrate processes, interact with file system and handle
-cross-platform concerns. This will yield a library which is portable,
-provided that it is (cross-)compiled for the intended target.
-
 ## Arguments and options
 
 Stan provides several inference engines, each with a large number of
@@ -121,7 +121,7 @@ languages.
 The (sloppy) productions for the command line language are:
 ```text
 tree    -> terms
-terms   -> term " " term | term
+terms   -> terms " " term | term
 term    -> pair | product | sum
 pair    -> key "=" value
 product -> type " " pairs
@@ -182,9 +182,9 @@ of two aspects.
 
 ### Serialization
 
-It is trivial to provide a method such as 
+It is trivial to provide a function such as 
 ```rust
-fn `write_json<T: Serialize>(data: T, file: &Path) {}`
+fn write_json<T: Serialize>(data: &T, file: &Path) {}
 ```
 but it serves no purpose -- it does not enforce the conventions
 adopted for representing Stan data types in JSON (e.g. matrices
@@ -207,7 +207,7 @@ choices, but nonetheless represent decisions to be made!
 
 From a design perspective, this is a great place to defer to the user,
 at least for the moment. A principled approach would involve writing a
-data format for [serde](https://serde.rs/)
+data format for [serde](https://serde.rs/).
 
 ### Deserialization
 

From 2b5a086714b44898e5a3dd6ca738a990af1a8632 Mon Sep 17 00:00:00 2001
From: Andrew Radcliffe <andrewjradcliffe@gmail.com>
Date: Wed, 22 Nov 2023 17:33:03 -0800
Subject: [PATCH 4/5] Complete the sketch

---
 designs/0035-cmdstanrs.md | 143 ++++++++++++++++++++++++++++++++++----
 1 file changed, 129 insertions(+), 14 deletions(-)

diff --git a/designs/0035-cmdstanrs.md b/designs/0035-cmdstanrs.md
index 4e019b8..03a47f0 100644
--- a/designs/0035-cmdstanrs.md
+++ b/designs/0035-cmdstanrs.md
@@ -17,6 +17,8 @@ The goal is to provide an interface which enables users to:
   making these available for programmatic use)
 - call `diagnose` and `stansummary` tools and collect output
 
+The objective is to keep the interface as simple as possible.
+
 # Motivation
 [motivation]: #motivation
 
@@ -111,7 +113,7 @@ code simply won't compile. The resultant tree is an abstraction which
 enables the use of a single type (`CmdStanOutput`) to encapsulate a
 call to an executable.
 
-Unsurprisingly, the argument tree is a syntax tree for `CmdStan`
+Unsurprisingly, the argument tree is a syntax tree for CmdStan
 command arguments. We translate to the very simple command line
 language, but leave open the possibility of translation to other
 languages.
@@ -137,21 +139,30 @@ digit   -> "0" | ... | "9"
 
 value   -> number | path
 ```
-Where the productions for `number` and `path` are left out for brevity.
-The start symbol is `tree`. Generate the command line statement by
-folding the tree from left to right by generating the appropriate term
-from each node. 
+Where the productions for `number` and `path` are left out for
+brevity.  The start symbol is `tree`. Generate the command line
+statement by folding the tree from left to right by generating the
+appropriate term from each node, building up a linear argument list.
+I sketched
+[this](https://github.com/andrewjradcliffe/cmdstan-translator/blob/main/translate.scm)
+out in Scheme, why I am not sure.
 
 ### Ergonomics
 
+Philosophy:
+- pay (in LOC) for only what you need.
+- minimize differences between naming of the types and fields (see
+  below) in the Rust implementation and CmdStan.
+
 The builder pattern will be implemented for each `struct`, and for
-each `enum` variant (excluding unit variants). This
-enables the user to supply only the arguments for which they desire
-non-default values. Philosophy: pay (in LOC) for only what you need.
+each `enum` variant (excluding unit variants). This enables the user
+to supply only the arguments for which they desire non-default
+values. This leads to succinct code when one needs only the defaults
+([example](https://github.com/andrewjradcliffe/cmdstan-rs/blob/main/examples/bernoulli-many/main.rs)).
+
+#### A side effect of strong typing
 
-There is an incidental benefit (from my perspective) afforded by the
-strongly-typed representation: with
-[company-mode](https://github.com/company-mode/company-mode) and
+With [company-mode](https://github.com/company-mode/company-mode) and
 [eglot-mode](https://github.com/joaotavora/eglot)
 ([lsp-mode](https://github.com/emacs-lsp/lsp-mode/) also works) in
 Emacs 28.2, one can view options at each node in the argument tree by
@@ -171,14 +182,15 @@ The objective is for the interface to cover all options which can be
 passed to a compiled Stan program, that is, all methods and all
 options for said methods.
 
-
 ## Separation of concerns
 
 Other than the argument tree support, the interface proposed is very
 simple: the user can compile a Stan program, call it with arguments,
 get basic information from the output, and call
-`diagnose`/`stansummary`. Below, I provide the rationale for exclusion
-of two aspects.
+`diagnose`/`stansummary`.
+
+Below, I provide the rationale for exclusion of two aspects. My
+judgment is that they are useful, but are best developed separately.
 
 ### Serialization
 
@@ -231,3 +243,106 @@ ever.
 The current proposal is for `CmdStanOutput` to be capable of returning
 paths to the files and the user parses them however they desire.
 This leaves open the possibility of multiple parsing strategies.
+
+# Drawbacks
+[drawbacks]: #drawbacks
+
+Representing CmdStan arguments/options as a concrete syntax tree is
+potentially brittle. If the CmdStan grammar undergoes radical change,
+this interface will need to change accordingly. However, the CmdStan
+grammar is intended to be quite stable. Moreover, it is not
+necessarily the case that radical changes to the CmdStan grammar could
+be hidden behind something other abstraction.
+
+# Rationale and alternatives
+[rationale-and-alternatives]: #rationale-and-alternatives
+
+Other than the direct representation of the CmdStan syntax tree, the
+proposal contains nothing new. Utilizing a concrete representation of
+the syntax tree does have benefits:
+- all outputs handled via single type `CmdStanOutput`
+- elimination of individual structures and methods for each of Stan's
+inference algorithms
+
+The question remains: is this is a good idea?
+
+## Argument tree considerations
+
+As described above, the grammar for CmdStan arguments passed at the
+command line can be represented as a syntax tree through the use of
+sum and product types.
+- This enables compile-time validation of argument consistency -- the
+  worst that can happen is you provide a value that CmdStan does not
+  like (e.g. `num_threads=-20`).
+- At minimum, this will move a variety of run-time errors to compile
+  time; it might even help users to understand the methods and options
+  CmdStan provides.
+- This enables re-use of a parameterized argument tree -- one could
+  replace the inference method while leaving the other options
+  (e.g. data files) constant. As shown in [this
+  example](https://github.com/andrewjradcliffe/cmdstan-rs/blob/main/examples/bernoulli-many/main.rs),
+  such an approach can be quite expressive.
+
+Furthermore, representation as a concrete syntax tree enables the
+possibility of interesting features. One could parse the syntax tree
+from:
+- a string written to a log file
+- a string which is consistent with the grammar that CmdStan accepts
+
+The latter is interesting in that a user's extant command line input
+is all that is required to use the Rust interface. For example,
+this leads to the following syntax:
+
+```rust
+// Assuming we implemented this through the `FromStr` trait
+let tree: ArgumentTree = "method=sample data file=bernoulli.data.json".parse().unwrap();
+```
+
+This would substantially lower the barrier to adoption of the Rust
+interface as the user need only know what they are already doing.
+
+Due to Rust's orphan rules, such features would need to be implemented
+within this crate; they could be placed behind a feature gate to
+minimize compile time. It stands to reason that if we can translate to
+a string, we should be able to perform the inverse operation.
+
+The design philosophy here would be: a valid parse is whatever CmdStan
+is willing to accept. However, CmdStan accepts some weird statements.
+For example:
+```bash
+./bernoulli method=sample adapt engaged engaged=0 engaged engaged=1 gamma engaged gamma \
+    data filebernoulli.data.json
+```
+
+The proposal is to use [pest](https://github.com/pest-parser/pest),
+rather than write a custom parser.
+
+# Prior art
+[prior-art]: #prior-art
+
+I have used both the with CmdStanPy and the StanJulia suite of
+packages.  Years ago, I found them convenient.
+
+## Flat structure
+
+Both CmdStanPy and StanJulia pursue a flat structure. This works
+largely due to the provision of optional positional/keyword arguments
+in a dynamic language.
+
+This is not possible in Rust -- default values require the builder
+pattern in order to be ergonomic.
+
+## Naming
+
+The difference between naming of arguments/options in CmdStan and
+(CmdStanPy | StanJulia) can be a source of confusion. I suppose that
+one would not have this problem if one never used CmdStan.
+
+## Serialization/deserialization of inputs/outputs
+
+Undoubtedly, both CmdStanPy and the Julia suite are targeted at the
+dynamic language audience, which expects features such as
+serialization/deserialization to be built in.  In general, I would
+expect that Rust programmers would probably want to select their own
+I/O options, thus, I do not see it as a downside to exclude such
+features.

From 3f6dc99117572dc624a4dbb6ce3d9074691f8ce5 Mon Sep 17 00:00:00 2001
From: Andrew Radcliffe <andrewjradcliffe@gmail.com>
Date: Wed, 22 Nov 2023 17:34:24 -0800
Subject: [PATCH 5/5] Fix bash example

---
 designs/0035-cmdstanrs.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/designs/0035-cmdstanrs.md b/designs/0035-cmdstanrs.md
index 03a47f0..f94d79f 100644
--- a/designs/0035-cmdstanrs.md
+++ b/designs/0035-cmdstanrs.md
@@ -311,7 +311,7 @@ is willing to accept. However, CmdStan accepts some weird statements.
 For example:
 ```bash
 ./bernoulli method=sample adapt engaged engaged=0 engaged engaged=1 gamma engaged gamma \
-    data filebernoulli.data.json
+    data file=bernoulli.data.json
 ```
 
 The proposal is to use [pest](https://github.com/pest-parser/pest),