From c323f1a83fe14373aadd189a9a92297686e076be Mon Sep 17 00:00:00 2001 From: Andrew Radcliffe Date: Tue, 21 Nov 2023 18:50:01 -0800 Subject: [PATCH 1/5] Draft, prior to addition of Prior Art section --- designs/0000-cmdstanrs.md | 233 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 233 insertions(+) create mode 100644 designs/0000-cmdstanrs.md diff --git a/designs/0000-cmdstanrs.md b/designs/0000-cmdstanrs.md new file mode 100644 index 0000000..1805bc7 --- /dev/null +++ b/designs/0000-cmdstanrs.md @@ -0,0 +1,233 @@ +- Feature Name: (fill me in with a unique ident, my_awesome_feature) +- Start Date: (fill me in with today's date, YYYY-MM-DD) +- RFC PR: (leave this empty) +- Stan Issue: (leave this empty) + +# Summary +[summary]: #summary + +This is a proposal for a Rust interface for CmdStan through compiled +executables, that is, no direct interface with C++. + +The goal is to provide an interface which enables users to: +- compile Stan programs (with arbitrary options) +- build and compose arguments/options (to be passed to C++ + executables) in idiomatic Rust +- call C++ executables, then memoize input and collect output (thereby + making these available for programmatic use) +- call `diagnose` and `stansummary` tools and collect output + +# Motivation +[motivation]: #motivation + +Suppose that you write Rust code. Suppose that you use `Stan` for +probabilistic programming. You have three choices for creating an +application which utilizes both: shell scripts, introduce a scripting +language (e.g. Python) dependency, or a Rust interface for `CmdStan`. + +Orchestration using the shell suffers from portability issues and +often leads unnecessary fragmentation of control flow. Introducing a +scripting language may confer portability, but control flow and error +handling are now divided between two languages; furthermore, this +necessitates code be written to serialize/deserialize intermediates. + +A Rust interface, in similar spirit to the CmdStan interfaces from +other languages, would provide the necessary abstraction to eliminate +the aforementioned problems. + +# Functional Specification + +Given a Stan program, a user of the library will compile the model (if +desired), call the executable with arguments (translated from +strongly-typed argument tree), and obtain a self-contained context +which encapsulates the pertinent information from the call. + +## Assumptions + +We assume (at our peril) that Rust programmers that will be able to +figure out how to satisfy the following requirement: +- a working CmdStan installation exists at some user-accessible path + +## Control: compilation and calling the resultant executable + +A `CmdStanModel` type will serve as an abstraction for a Stan program, +which may need to be compiled. Rather than compile on construction, a +compilation method must be explicitly called in user code (assuming +that a satisfactory executable does not exist yet). + +Two arguments will be necessary to create a `CmdStanModel`: +1. a path to a CmdStan installation +2. a path to a Stan program + +Methods (receiver: `CmdStanModel` instance) exposed to the user will +include: +- `validate_cmdstan` : determine whether the CmdStan installation works +- `executable_works` : is there a working executable at the path + implied by the Stan file? +- `compile_with_args` : attempt to compile a Stan program with + optional `make` arguments +- `call_executable` : call executable with the given argument tree; on + success, return a `CmdStanOutput` instance. + +## Output + +Output of a successful `call_executable` call on a `CmdStanModel` will +produce a `CmdStanOuput` instance, which encapsulates the context of +the call. This includes: +- the console output (exit status, stdout and stderr), +- the argument tree provided to `call_executable` +- the current working directory of the process at the time the call was made +- the CmdStan installation from the parent `CmdStanModel` + +The objective is for `CmdStanOutput` to be a self-contained record +which includes all pertinent information from a successful executable +call. This structure can then be used to direct calls of +`diagnose`/`stansummary`. Naturally, methods on said type will be +present to expose the context to the user and perform utility +functions (e.g. return a list of output file paths). + +## Processes, IO + +The proposal is to use the Rust `std` library, in particular the +[process](https://doc.rust-lang.org/std/process/index.html), +[path](https://doc.rust-lang.org/std/path/index.html), +[fs](https://doc.rust-lang.org/std/fs/index.html), and +[ffi](https://doc.rust-lang.org/std/ffi/index.html) modules, to +orchestrate processes, interact with file system and handle +cross-platform concerns. This will yield a library which is portable, +provided that it is (cross-)compiled for the intended target. + +## Arguments and options + +Stan provides several inference engines, each with a large number of +options. CmdStan in turn handles this heterogeneity. + +To encapsulate the arguments passed at the command line (to a compiled +executable), the proposal is a strongly-typed representation of this +heterogeneity using a combination of sum types (Rust `enum`) and +product types (Rust `struct`). By construction, this representation +prevents the formation of inconsistent argument combinations -- the +code simply won't compile. The resultant tree is an abstraction which +enables the use of a single type (`CmdStanOutput`) to encapsulate a +call to an executable. + +Unsurprisingly, the argument tree is a syntax tree for `CmdStan` +command arguments. We translate to the very simple command line +language, but leave open the possibility of translation to other +languages. + +### Translation + +The (sloppy) productions for the command line language are: +```text +tree -> terms +terms -> term " " term | term +term -> pair | product | sum +pair -> key "=" value +product -> type " " pairs +sum -> type "=" variant " " terms | type "=" variant +pairs -> pairs " " pair | pair + +key -> A +A -> A alpha | beta +alpha -> letter | digit | "_" +beta -> letter +letter -> "A" | ... | "z" +digit -> "0" | ... | "9" + +value -> number | path +``` +Where the productions for `number` and `path` are left out for brevity. +The start symbol is `tree`. Generate the command line statement by +folding the tree from left to right by generating the appropriate term +from each node. + +### Ergonomics + +The builder pattern will be implemented for each `struct`, and for +each `enum` variant (excluding unit variants). This +enables the user to supply only the arguments for which they desire +non-default values. Philosophy: pay (in LOC) for only what you need. + +There is an incidental benefit (from my perspective) afforded by the +strongly-typed representation: with +[company-mode](https://github.com/company-mode/company-mode) and +[eglot-mode](https://github.com/joaotavora/eglot) +([lsp-mode](https://github.com/emacs-lsp/lsp-mode/) also works) in +Emacs 28.2, one can view options at each node in the argument tree by +code that looks something like the following: + +```rust +ArgumentTree::builder(). // hover on the `.` +``` + +If one has a Rust language server and completion support in their +editor, this is a free side effect. Whether it will help anyone +is uncertain. + +### Coverage + +The objective is for the interface to cover all options which can be +passed to a compiled Stan program, that is, all methods and all +options for said methods. + + +## Separation of concerns + +Other than the argument tree support, the interface proposed is very +simple: the user can compile a Stan program, call it with arguments, +get basic information from the output, and call +`diagnose`/`stansummary`. Below, I provide the rationale for exclusion +of two aspects. + +### Serialization + +It is trivial to provide a method such as +```rust +fn `write_json(data: T, file: &Path) {}` +``` +but it serves no purpose -- it does not enforce the conventions +adopted for representing Stan data types in JSON (e.g. matrices +represented as a vector of *row* vectors, not a vector of *column* +vectors), hence, would likely lead to unexpected (and potentially +silent!) errors. + +In order to develop a serializer which respects the conventions for +Stan data types, one would need to declare conventions for the mapping +of Rust data types to Stan data +types. [serde_json](https://github.com/serde-rs/json) would be nice to +use, but has some incompatibilities (Rust tuple is represented as an +array, rather than an object). + +Moreover, to represent matrices, complex numbers, etc., one would need +to support types from the crate ecosystem since the standard library +lacks these -- [nalgebra](https://github.com/dimforge/nalgebra) and +[num-complex](https://github.com/rust-num/num-complex) are reasonable +choices, but nonetheless represent decisions to be made! + +From a design perspective, this is a great place to defer to the user, +at least for the moment. A principled approach would involve writing a +data format for [serde](https://serde.rs/) + +### Deserialization + +Parsing Stan CSVs to a strongly-typed representation is simple if one +wishes to simply obtain a matrix of values (or `Vec>` if we +limit ourselves to `std` library types). However, one needs to extract +the variables from each line, thus, one needs to know the types (and +their dimensions). A recursive definition of types using `enum`s and +`struct`s could probably work to represent such a thing in Rust, but +may not necessarily be particularly ergonomic (i.e. much unavoidable +boilerplate would be needed to use the resultant type). + +Procedural macros, applied to a Stan program stored in a string +literal in a Rust program, could be used to generate types and a +parser for Stan CSVs produced said program. However, in order to +implement such an idea, one would first need to adopt conventions for +representing Stan data types using Rust data types. This requires +careful thought and and is something best left for the future, if +ever. + +The current proposal is for `CmdStanOutput` to be capable of returning +paths to the files and the user parses them however they desire. +This leaves open the possibility of multiple parsing strategies. From 5d6456fec220136578f9eb05be648d77d5cad488 Mon Sep 17 00:00:00 2001 From: Andrew Radcliffe Date: Tue, 21 Nov 2023 18:52:58 -0800 Subject: [PATCH 2/5] Wishful numbering --- designs/{0000-cmdstanrs.md => 0035-cmdstanrs.md} | 0 1 file changed, 0 insertions(+), 0 deletions(-) rename designs/{0000-cmdstanrs.md => 0035-cmdstanrs.md} (100%) diff --git a/designs/0000-cmdstanrs.md b/designs/0035-cmdstanrs.md similarity index 100% rename from designs/0000-cmdstanrs.md rename to designs/0035-cmdstanrs.md From 1cbf5275cc86a83426cbf79962a97ff104bb1259 Mon Sep 17 00:00:00 2001 From: Andrew Radcliffe Date: Tue, 21 Nov 2023 19:02:56 -0800 Subject: [PATCH 3/5] Some obvious touch-ups after viewing the rendered markdown --- designs/0035-cmdstanrs.md | 40 +++++++++++++++++++-------------------- 1 file changed, 20 insertions(+), 20 deletions(-) diff --git a/designs/0035-cmdstanrs.md b/designs/0035-cmdstanrs.md index 1805bc7..4e019b8 100644 --- a/designs/0035-cmdstanrs.md +++ b/designs/0035-cmdstanrs.md @@ -1,7 +1,7 @@ -- Feature Name: (fill me in with a unique ident, my_awesome_feature) -- Start Date: (fill me in with today's date, YYYY-MM-DD) -- RFC PR: (leave this empty) -- Stan Issue: (leave this empty) +- Feature Name: cmdstanrs +- Start Date: 2023-11-22 +- RFC PR: +- Stan Issue: # Summary [summary]: #summary @@ -42,12 +42,23 @@ desired), call the executable with arguments (translated from strongly-typed argument tree), and obtain a self-contained context which encapsulates the pertinent information from the call. -## Assumptions +### Assumptions We assume (at our peril) that Rust programmers that will be able to figure out how to satisfy the following requirement: - a working CmdStan installation exists at some user-accessible path +### Processes, IO + +The proposal is to use the Rust `std` library, in particular the +[process](https://doc.rust-lang.org/std/process/index.html), +[path](https://doc.rust-lang.org/std/path/index.html), +[fs](https://doc.rust-lang.org/std/fs/index.html), and +[ffi](https://doc.rust-lang.org/std/ffi/index.html) modules, to +orchestrate processes, interact with file system and handle +cross-platform concerns. This will yield a library which is portable, +provided that it is (cross-)compiled for the intended target. + ## Control: compilation and calling the resultant executable A `CmdStanModel` type will serve as an abstraction for a Stan program, @@ -86,17 +97,6 @@ call. This structure can then be used to direct calls of present to expose the context to the user and perform utility functions (e.g. return a list of output file paths). -## Processes, IO - -The proposal is to use the Rust `std` library, in particular the -[process](https://doc.rust-lang.org/std/process/index.html), -[path](https://doc.rust-lang.org/std/path/index.html), -[fs](https://doc.rust-lang.org/std/fs/index.html), and -[ffi](https://doc.rust-lang.org/std/ffi/index.html) modules, to -orchestrate processes, interact with file system and handle -cross-platform concerns. This will yield a library which is portable, -provided that it is (cross-)compiled for the intended target. - ## Arguments and options Stan provides several inference engines, each with a large number of @@ -121,7 +121,7 @@ languages. The (sloppy) productions for the command line language are: ```text tree -> terms -terms -> term " " term | term +terms -> terms " " term | term term -> pair | product | sum pair -> key "=" value product -> type " " pairs @@ -182,9 +182,9 @@ of two aspects. ### Serialization -It is trivial to provide a method such as +It is trivial to provide a function such as ```rust -fn `write_json(data: T, file: &Path) {}` +fn write_json(data: &T, file: &Path) {} ``` but it serves no purpose -- it does not enforce the conventions adopted for representing Stan data types in JSON (e.g. matrices @@ -207,7 +207,7 @@ choices, but nonetheless represent decisions to be made! From a design perspective, this is a great place to defer to the user, at least for the moment. A principled approach would involve writing a -data format for [serde](https://serde.rs/) +data format for [serde](https://serde.rs/). ### Deserialization From 2b5a086714b44898e5a3dd6ca738a990af1a8632 Mon Sep 17 00:00:00 2001 From: Andrew Radcliffe Date: Wed, 22 Nov 2023 17:33:03 -0800 Subject: [PATCH 4/5] Complete the sketch --- designs/0035-cmdstanrs.md | 143 ++++++++++++++++++++++++++++++++++---- 1 file changed, 129 insertions(+), 14 deletions(-) diff --git a/designs/0035-cmdstanrs.md b/designs/0035-cmdstanrs.md index 4e019b8..03a47f0 100644 --- a/designs/0035-cmdstanrs.md +++ b/designs/0035-cmdstanrs.md @@ -17,6 +17,8 @@ The goal is to provide an interface which enables users to: making these available for programmatic use) - call `diagnose` and `stansummary` tools and collect output +The objective is to keep the interface as simple as possible. + # Motivation [motivation]: #motivation @@ -111,7 +113,7 @@ code simply won't compile. The resultant tree is an abstraction which enables the use of a single type (`CmdStanOutput`) to encapsulate a call to an executable. -Unsurprisingly, the argument tree is a syntax tree for `CmdStan` +Unsurprisingly, the argument tree is a syntax tree for CmdStan command arguments. We translate to the very simple command line language, but leave open the possibility of translation to other languages. @@ -137,21 +139,30 @@ digit -> "0" | ... | "9" value -> number | path ``` -Where the productions for `number` and `path` are left out for brevity. -The start symbol is `tree`. Generate the command line statement by -folding the tree from left to right by generating the appropriate term -from each node. +Where the productions for `number` and `path` are left out for +brevity. The start symbol is `tree`. Generate the command line +statement by folding the tree from left to right by generating the +appropriate term from each node, building up a linear argument list. +I sketched +[this](https://github.com/andrewjradcliffe/cmdstan-translator/blob/main/translate.scm) +out in Scheme, why I am not sure. ### Ergonomics +Philosophy: +- pay (in LOC) for only what you need. +- minimize differences between naming of the types and fields (see + below) in the Rust implementation and CmdStan. + The builder pattern will be implemented for each `struct`, and for -each `enum` variant (excluding unit variants). This -enables the user to supply only the arguments for which they desire -non-default values. Philosophy: pay (in LOC) for only what you need. +each `enum` variant (excluding unit variants). This enables the user +to supply only the arguments for which they desire non-default +values. This leads to succinct code when one needs only the defaults +([example](https://github.com/andrewjradcliffe/cmdstan-rs/blob/main/examples/bernoulli-many/main.rs)). + +#### A side effect of strong typing -There is an incidental benefit (from my perspective) afforded by the -strongly-typed representation: with -[company-mode](https://github.com/company-mode/company-mode) and +With [company-mode](https://github.com/company-mode/company-mode) and [eglot-mode](https://github.com/joaotavora/eglot) ([lsp-mode](https://github.com/emacs-lsp/lsp-mode/) also works) in Emacs 28.2, one can view options at each node in the argument tree by @@ -171,14 +182,15 @@ The objective is for the interface to cover all options which can be passed to a compiled Stan program, that is, all methods and all options for said methods. - ## Separation of concerns Other than the argument tree support, the interface proposed is very simple: the user can compile a Stan program, call it with arguments, get basic information from the output, and call -`diagnose`/`stansummary`. Below, I provide the rationale for exclusion -of two aspects. +`diagnose`/`stansummary`. + +Below, I provide the rationale for exclusion of two aspects. My +judgment is that they are useful, but are best developed separately. ### Serialization @@ -231,3 +243,106 @@ ever. The current proposal is for `CmdStanOutput` to be capable of returning paths to the files and the user parses them however they desire. This leaves open the possibility of multiple parsing strategies. + +# Drawbacks +[drawbacks]: #drawbacks + +Representing CmdStan arguments/options as a concrete syntax tree is +potentially brittle. If the CmdStan grammar undergoes radical change, +this interface will need to change accordingly. However, the CmdStan +grammar is intended to be quite stable. Moreover, it is not +necessarily the case that radical changes to the CmdStan grammar could +be hidden behind something other abstraction. + +# Rationale and alternatives +[rationale-and-alternatives]: #rationale-and-alternatives + +Other than the direct representation of the CmdStan syntax tree, the +proposal contains nothing new. Utilizing a concrete representation of +the syntax tree does have benefits: +- all outputs handled via single type `CmdStanOutput` +- elimination of individual structures and methods for each of Stan's +inference algorithms + +The question remains: is this is a good idea? + +## Argument tree considerations + +As described above, the grammar for CmdStan arguments passed at the +command line can be represented as a syntax tree through the use of +sum and product types. +- This enables compile-time validation of argument consistency -- the + worst that can happen is you provide a value that CmdStan does not + like (e.g. `num_threads=-20`). +- At minimum, this will move a variety of run-time errors to compile + time; it might even help users to understand the methods and options + CmdStan provides. +- This enables re-use of a parameterized argument tree -- one could + replace the inference method while leaving the other options + (e.g. data files) constant. As shown in [this + example](https://github.com/andrewjradcliffe/cmdstan-rs/blob/main/examples/bernoulli-many/main.rs), + such an approach can be quite expressive. + +Furthermore, representation as a concrete syntax tree enables the +possibility of interesting features. One could parse the syntax tree +from: +- a string written to a log file +- a string which is consistent with the grammar that CmdStan accepts + +The latter is interesting in that a user's extant command line input +is all that is required to use the Rust interface. For example, +this leads to the following syntax: + +```rust +// Assuming we implemented this through the `FromStr` trait +let tree: ArgumentTree = "method=sample data file=bernoulli.data.json".parse().unwrap(); +``` + +This would substantially lower the barrier to adoption of the Rust +interface as the user need only know what they are already doing. + +Due to Rust's orphan rules, such features would need to be implemented +within this crate; they could be placed behind a feature gate to +minimize compile time. It stands to reason that if we can translate to +a string, we should be able to perform the inverse operation. + +The design philosophy here would be: a valid parse is whatever CmdStan +is willing to accept. However, CmdStan accepts some weird statements. +For example: +```bash +./bernoulli method=sample adapt engaged engaged=0 engaged engaged=1 gamma engaged gamma \ + data filebernoulli.data.json +``` + +The proposal is to use [pest](https://github.com/pest-parser/pest), +rather than write a custom parser. + +# Prior art +[prior-art]: #prior-art + +I have used both the with CmdStanPy and the StanJulia suite of +packages. Years ago, I found them convenient. + +## Flat structure + +Both CmdStanPy and StanJulia pursue a flat structure. This works +largely due to the provision of optional positional/keyword arguments +in a dynamic language. + +This is not possible in Rust -- default values require the builder +pattern in order to be ergonomic. + +## Naming + +The difference between naming of arguments/options in CmdStan and +(CmdStanPy | StanJulia) can be a source of confusion. I suppose that +one would not have this problem if one never used CmdStan. + +## Serialization/deserialization of inputs/outputs + +Undoubtedly, both CmdStanPy and the Julia suite are targeted at the +dynamic language audience, which expects features such as +serialization/deserialization to be built in. In general, I would +expect that Rust programmers would probably want to select their own +I/O options, thus, I do not see it as a downside to exclude such +features. From 3f6dc99117572dc624a4dbb6ce3d9074691f8ce5 Mon Sep 17 00:00:00 2001 From: Andrew Radcliffe Date: Wed, 22 Nov 2023 17:34:24 -0800 Subject: [PATCH 5/5] Fix bash example --- designs/0035-cmdstanrs.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/designs/0035-cmdstanrs.md b/designs/0035-cmdstanrs.md index 03a47f0..f94d79f 100644 --- a/designs/0035-cmdstanrs.md +++ b/designs/0035-cmdstanrs.md @@ -311,7 +311,7 @@ is willing to accept. However, CmdStan accepts some weird statements. For example: ```bash ./bernoulli method=sample adapt engaged engaged=0 engaged engaged=1 gamma engaged gamma \ - data filebernoulli.data.json + data file=bernoulli.data.json ``` The proposal is to use [pest](https://github.com/pest-parser/pest),