Rust interface for CmdStan #51

andrewjradcliffe · 2023-11-23T01:57:03Z

N.B. no direct interface with C++!

This is a proposal for a CmdStan interface for the Rust programming language, analogous to the CmdStanPy et al. interfaces. There is one distinction, which some may consider a defect (critical input actively solicited): a strongly-typed representation of the CmdStan syntax tree.

link to rendered proposal: https://github.com/andrewjradcliffe/design-docs/blob/cmdstan-rs/designs/0035-cmdstanrs.md

Edit: add link

WardBrian

Thank you for putting this together. I left some thoughts/comments on individual points below, but I think the overall structure laid out here is a good direction for a rust interface to CmdStan. In particular, I at least am sold on the approach to the argument structure

WardBrian · 2023-11-27T17:19:40Z

designs/0035-cmdstanrs.md

+Suppose that you write Rust code. Suppose that you use `Stan` for
+probabilistic programming. You have three choices for creating an
+application which utilizes both: shell scripts, introduce a scripting
+language (e.g. Python) dependency, or a Rust interface for `CmdStan`.


There are a few more exotic options as well:

A (hypothetical) Rust interface to the stan/services APIs

Using Rust implementations of algorithms like NUTS and bridgestan, as is done in e.g. nutpie

These may be worth mentioning in the Rationale and Alternatives, as both have their own downsides as well.

WardBrian · 2023-11-27T17:24:20Z

designs/0035-cmdstanrs.md

+A `CmdStanModel` type will serve as an abstraction for a Stan program,
+which may need to be compiled. Rather than compile on construction, a
+compilation method must be explicitly called in user code (assuming
+that a satisfactory executable does not exist yet).


This model is something CmdStanPy is moving away from, because a CmdStanModel is pretty useless without an executable.

In a strongly typed context, one could imagine that this type could instead be called something like StanProgram, and the result of calling compile is a CmdStanModel which is always associated with a executable on disk - this is sometimes called the "typestate" pattern

Excellent idea. As is often the case, putting a name to a common idiom (the typestate pattern) is illuminating.

WardBrian · 2023-11-27T17:28:07Z

designs/0035-cmdstanrs.md

+appropriate term from each node, building up a linear argument list.
+I sketched
+[this](https://github.com/andrewjradcliffe/cmdstan-translator/blob/main/translate.scm)
+out in Scheme, why I am not sure.


I'm not sure this section is necessary, providing an example of use and/or a list of expected structs may be more useful

I agree -- this is at best a distraction.

WardBrian · 2023-11-27T17:31:27Z

designs/0035-cmdstanrs.md

+[num-complex](https://github.com/rust-num/num-complex) are reasonable
+choices, but nonetheless represent decisions to be made!
+
+From a design perspective, this is a great place to defer to the user,


For ease of implementation this is true, but the list of above subtleties (to me) make a loud argument in the opposite direction: this is something hard, and the obvious approach will easily lead to incorrect results, so it should be handled by the library (eventually?)

Alas, yes, it should be handled by the library. It could be gated by a feature, so as to delay the need for an implementation for the initial version. As additional dependencies would likely be introduced, a feature gate is probably appropriate anyway.

WardBrian · 2023-11-27T17:33:58Z

designs/0035-cmdstanrs.md

+Procedural macros, applied to a Stan program stored in a string
+literal in a Rust program, could be used to generate types and a
+parser for Stan CSVs produced said program. However, in order to


There is a middle-ground between this (very cool, but tricky) idea and doing nothing, which is providing some primitives (which may themselves be macros) that allow the user to provide the information required about types and names of variables. Worth considering, and it seems like any future automation could build on it.

The gate here is: decide on conventions for representation of Stan types as Rust types. nalgebra is quite similar to Eigen, including memory layout, so matrices are covered.

For arrays of dimension greater than 2, there is a problem.

Yes, there is ndarray, but its future is uncertain.

Nested Vec<Vec<Vec ... >>> naturally works, but I doubt users would like it (no syntactic sugar for indexing).

If we take the latter option (the only sensible course for stable development), then it would be relatively straightforward to use a procedural macro to derive a parser for a user-provided struct where the field names and respective types must match the Stan CSV output. I do something similar for my use cases.

WardBrian · 2023-11-27T17:40:48Z

designs/0035-cmdstanrs.md

+- At minimum, this will move a variety of run-time errors to compile
+  time; it might even help users to understand the methods and options
+  CmdStan provides.
+- This enables re-use of a parameterized argument tree -- one could


[comment] This is a very nice plus, IMO.

WardBrian · 2023-11-27T17:42:12Z

designs/0035-cmdstanrs.md

+- a string written to a log file
+- a string which is consistent with the grammar that CmdStan accepts


Or, one hopes, the JSON file cmdstan itself can now produce

I would also like to support reading from JSON; that should be relatively easy.

I got interested in PEGs, went a little overboard, and now there's a parser which is consistent with CmdStan; it supports a variety of input formats.

WardBrian · 2023-11-27T17:45:28Z

designs/0035-cmdstanrs.md

+    data file=bernoulli.data.json
+```
+
+The proposal is to use [pest](https://github.com/pest-parser/pest),


One thing I am curious about is total dependencies (it may be worth calling this out as its own list/table somewhere).

Both Python and R's wrappers at least strive to be minimal, within what their standard libraries provide. I think the wonder of cargo making dependencies on the whole less painful has a downside of them tending to proliferate.

Could items like this be hidden behind a cargo feature? Continuing the only-pay-for-what-you-want mentality from earlier

Things like MSRV may also be worth mentioning!

If not for this function, the MSRV would be quite a bit older.

Assuming that we put the parser behind a feature gate, the only remaining dependency is thiserror, which could be eliminated by writing out all the fmt::Display implementations (probably a couple hundred LOC).

I don't have any objections to thiserror - it seems quite common in practice and rather small in terms of scope.

You bring up an interesting other point: I (sort of implicitly) expected that this wrapper could be written using entirely safe Rust. Is that unsafe function really needed, or is it called as an optimization?

What other usages of unsafe would you anticipate, and why?

The single unsafe function arose from the need to split the string on the occurrence of '.', with the objective of replicating CmdStan's interesting treatment of paths. This could easily be eliminated if we weren't dealing with Os-specific details (hidden behind WTF-8 in OsString).

CmdStan basically just splits at the first occurrence of '.', starting from the right.
This becomes a problem when the only thing to the left of the first '.' is the empty string and there the input matches any of the forms below.

./bernoulli id=2 method=sample num_chains=3 data file=bernoulli.data.json output file='..' # produces as output file names: `._2.`, `._3.`, `._4.` ./bernoulli id=2 method=sample num_chains=3 data file=bernoulli.data.json output file=.foo # produces as output file names: `_2.foo`, `_3.foo`, `_4.foo` # this is actually a repeat of the above, with "foo" as the empty string ./bernoulli id=2 method=sample num_chains=3 data file=bernoulli.data.json output file='.' # produces as output file names: `_2.`, `_3.`, `_4.`

It is not possible to easily treat these by two edge cases due to the definition of an extension.
I might be able to come up with another way to do it, but it will complicate the code accordingly.

I see no issue with unsafe which is limited to a very specific use case that is fundamentally sound. This use is particularly easy to verify soundness: the &[u8] slices are created by as_encoded_bytes and promptly consumed; when the byte slices are re-joined, it is impossible for either to have become invalid UTF-8.

It is worth noting that OsStr::from_encoded_bytes_unchecked only appears in two places; both appearances involve inputs which are created within the same function body, that is, no foreign inputs, no intermediates escape the scope in which they are created and said intermediates are consumed in the same function body. Incidentally, it is more efficient to play with byte slices, but that was not my objective.

Each use looks like:

let bytes = path.as_encoded_bytes(); let mut iter = bytes.rsplitn(2, |b| *b == b'.'); let (prefix, suffix) = match (iter.next(), iter.next()) { (Some(suffix), Some(prefix)) => { // SAFETY: // - each fragment only contains content that originated // from `OsStr::as_encoded_bytes`. // - split with ASCII period, which is a non-empty UTF-8 // substring unsafe { ( OsStr::from_encoded_bytes_unchecked(prefix), OsStr::from_encoded_bytes_unchecked(suffix), ) } } _ => (path, "csv".as_ref()), };

I general, my philosophy on use of unsafe is that it is best to avoid where possible; for a crate like this, I certainly wouldn't expose any unsafe public APIs. There's no reason to expose an unsafe internal API for this crate, as there is nothing that warrants it here.

I realize that my response above could actually be read as somewhat alarming -- potentially implies lots of unsafe. Limited to two locations (1, 2) with no plans to expand beyond these, I hope that the above is reassuring.

This also sounds like something we will probably change (stan-dev/cmdstan#1213) - the current system is mostly designed-by-accident

Your use of unsafe seem fine to me. However, since essentially the same unsafe code is used in two places I'd encapsulate it in a single safe-to-call helper function.

This file extension handling corner case feels like a CmdStan bug and I wouldn't bother supporting it beyond a clear error message.

It is quite frustrating that as_/from_encoded_bytes does not provide a safe API for splitting OsStrs. Nevertheless, in this case unsafe can be avoided with a couple of allocations:

use std::ffi::{OsString, OsStr}; let (prefix, suffix) = match (path.file_stem(), path.extension()) { (Some(prefix), Some(suffix)) => (prefix, suffix.to_os_string()), (Some(name), None) if name.as_encoded_bytes().first() == Some(&b'.') => { let mut fake_name = std::path::PathBuf::from("x"); fake_name.as_mut_os_string().push(name); (OsStr::new(""), fake_name.extension().unwrap().to_os_string()) } (Some(name), None) => (name, OsString::from("csv")), (None, _) => { // No file name, which means the last path component is either "." or ".." if path.as_os_str().as_encoded_bytes().ends_with(b"..") { (OsStr::new("."), OsString::from("")) } else { (OsStr::new(""), OsString::from("")) } } };

Coming back to point (2), if the user specifies output file=out.d/file CmdStan tries to write to out_1.d/file etc but if the directory out_1.d does not exist CmdnStan won't create it and the output just silently disappears. That can't be right.

In order:

I completely agree. It should be moved to a single function.

I agree again. I found this when I was writing tests for the Rust code. However, if CmdStan does it, it seems fine to support it if it is easy enough. Splitting WTF-8 encoded bytes on ASCII is fundamentally sound, so the unsafe-ness doesn't bother me. In fact, std uses the same approach; I had not looked at std until now.

It is only the conversion back to OsStr that is unsafe, but I agree that a safe API would be nice. Unfortunately, literal split on the first period from the right is very hard to avoid. Your approach is clever, but is defeated by inputs such as below:

function bernoulli() { ./bernoulli data file=bernoulli.data.json sample num_chains=3 output file=$1 } mkdir -p foo/bar # files created by CmdStan bernoulli 'foo/.bar' # "foo/_1.bar", "foo/_2.bar", "foo/_3.bar" bernoulli 'foo/bar/' # "foo/bar/_1.csv", "foo/bar/_2.csv", "foo/bar/_3.csv" bernoulli 'foo/bar/.' # "foo/bar/_1.", "foo/bar/_2.", "foo/bar/_3." bernoulli 'foo/bar/..' # "foo/bar/._2.", "foo/bar/._3.", "foo/bar/._4." bernoulli 'foo/bar/...' # "foo/bar/.._2.", "foo/bar/.._3.", "foo/bar/.._4."

One might think that path::parent and path::file_name would reduce the problem to the code you gave, substituting the result of path::file_name, but that does not work due to path normalization. For example, foo/bar/. is equivalent to foo/bar when treated as a path; likewise for foo/bar/.

From a code complexity standpoint, splitting at the '.' byte is much simpler, as it does not require addition of more case logic, and matches what CmdStan does by default.

This is a good example of where unsafe is both easier to maintain and verify the correctness of. Using the std::path API quickly becomes cumbersome, as it performs the normalization of path components one would naturally expect; in essence, the logic of the program must now test for all the ways that a string with strange placements of '.' would be subject to path normalization, then avoid them.

This was sufficiently interesting that I wrote it as an issue for the CmdStan repository.

Summary

Until CmdStan switches from splitting on the last dot, there is no reason to deviate from said behavior in the Rust treatment of paths. Path normalization is not trivial (see cppreference for a nice recapitulation of the steps). Effort spent on attempting to reproduce the CmdStan behavior via a convoluted use of Path and OsStr, on the basis that no code should be unsafe, is likely to lead to logical errors-- I am not keen on working out all pathological (no pun intended) behavior induced by lack of path normalization and splitting on the last '.'. It is much simpler and maintainable to verify that the invariants of an unsafe function are satisfied than to create a Rube Goldberg machine.

Examples like Case 2 demonstrate that such a machine would not be feasible, as it is not sufficient to split parent and child, check the respective start/end, and be done. One must scan the entire string for dots and this is only possible after as_bytes_encoded, hence, we are back to unsafe to put the halves back together (i.e. right back to the original proposal).

Furthermore, the discrepancy between CmdStan's behavior and the proposed interface would be a a potential source of confusion. If someone wants to use wacky file names and the program (CmdStan) accepts them, we would be remiss to impose artificial limitations.

WardBrian · 2023-11-27T17:50:03Z

designs/0035-cmdstanrs.md

+The objective is for `CmdStanOutput` to be a self-contained record
+which includes all pertinent information from a successful executable
+call. This structure can then be used to direct calls of
+`diagnose`/`stansummary`. Naturally, methods on said type will be


Can you supply a complete list of signatures for methods on CmdStanOutput? You mention this path helper (very nice to provide, somewhat tricky to get right!) and stansummary, but what would their signatures look like, and are there more? I'm sure some of this will need to be discovered during implementation

Public signatures are (at time of writing)

/// Return the output files associated with the `CmdStan` call. pub fn output_files(&self) -> Vec<PathBuf>; /// Return the diagnostic files associated with the `CmdStan` call. pub fn diagnostic_files(&self) -> Vec<PathBuf>; /// Return the profile files associated with the `CmdStan` call. pub fn profile_files(&self) -> Vec<PathBuf>; /// Return an immutable reference to console output associated /// with the `CmdStan` call. pub fn output(&self) -> &process::Output; /// Return a reference to the current working directory at the /// time of the call. pub fn cwd_at_call(&self) -> &Path; /// Return a reference to the argument tree with which the call /// was made. pub fn argument_tree(&self) -> &ArgumentTree; // EDIT: on second thought, no. The next 3 methods introduce an unnecessary // coupling of state. // pub fn cmdstan(&self) -> &Path // pub fn diagnose(&self) -> io::Result<process::Output>; // pub fn stansummary(&self, opts: Option<StanSummaryOptions>) -> io::Result<process::Output> /// Write the console output to a file. The path at which the /// file will be created is determined in the following manner: If /// `file` is `None`, `"log.txt"` is adjoined on to `cwd_at_call`. /// If `file` is `Some(path)` and `path` is a relative path, then /// `path` is adjoined on to `cwd_at_call`; otherwise, `path` is /// assumed to be an absolute path and is used without /// modification. Upon successful write, the path at which the /// file was created is returned. pub fn write_output<P: AsRef<Path>>(&self, file: Option<P>) -> io::Result<PathBuf>

I don't know if I would keep write_output.

The path helper is actually pretty easy to figure out if you have the argument tree and cache the path of the working directory at the point of call. If the user had any relative paths, then said paths must have been relative to the directory at which they made the call (else the call fails and we don't make it this far); if they used absolute paths, then everything is sensible.

edit: on reflection, I should not blithely underestimate the ability of users to create valid-but-spaghetti paths.

WardBrian · 2023-11-27T17:52:16Z

designs/0035-cmdstanrs.md

+
+Methods (receiver: `CmdStanModel` instance) exposed to the user will
+include:
+- `validate_cmdstan` : determine whether the CmdStan installation works


Both this and passing the path to the CmdStan installation feel like they're concerns separate from a CmdStanModel instance to some extent.

Does it make sense to have a CmdStanInstallation struct which stores a path and validation logic? I'm not super familiar with what is idiomatic, but could there even be a lazy_once global which is used by default, and an overridden constructor if the user needs to supply their own or two separate instances in the same program?

Agreed. The path to the cmdstan installation going through CmdStanModel is definitely an undesirable thing. Separating into something like

#[derive(Clone)] pub struct CmdStanInstallation { path: PathBuf } pub struct StanProgram { path: PathBuf } impl CmdStanInstallation { fn validate(&self) -> io::Result<bool>; // you can break the install by deleting it pub fn compile(&self, prog: &StanProgram) -> Result<CmdStanModel, CompilationError>; pub fn try_from_env() -> io::Result<Self>; // we look for `CMDSTAN` in environment } impl TryFrom<&Path> for CmdStanInstallation { /* validate CmdStan works, else error */ }

Using an environment variable is a reasonably consistent way to set a global without committing to a lazily-initialized global in Rust. Admittedly, this just moves the problem elsewhere, but it is clear what you would be getting -- the current CmdStan, as determined by a specific environment variable.

I am not fond of user-facing globals (unless they truly are just static data, e.g. color schemes). If the global depends on something a user is to provide, it gets messy. If we provide an interface that requires a validated CmdStan installation, then it's up to the user to figure out how to meet said requirement. I don't see much harm placing the complexity on the user -- after all, it becomes our problem if it is ambiguous as to which CmdStan should be used at any given point.

If a user really wanted to do it via global, the code below is an option.

static CMDSTAN: once_cell::sync::Lazy<PathBuf> = once_cell::sync::Lazy::new(|| infallible_go_get_cmdstan()); fn my_cmdstan_install() -> CmdStanInstallation { let path: &Path = &*CMDSTAN; CmdStanInstallation::try_from(path).unwrap() }

SteveBronder · 2023-11-27T18:10:43Z

Kind of adjacent to this proposal, for a compiled language would something like a foreign function interface scheme be more appropriate? Using something like ffistan the rust interface would not need to compile a separate binary

WardBrian · 2023-11-27T19:20:44Z

Yeah, wrapping ffistan in rust is on my to-do list after I finish fleshing out the julia/python/R... but I can also imagine situations where someone would prefer a cmdstan wrapper even if they had an in-memory approach available. The simplest such example is one where you need to store a large number of draws for a large number of parameters and the total size is too big to store in memory

SteveBronder · 2023-11-29T20:42:42Z

@WardBrian Does ffistan not use Stan's writer class? Then folks could just use the writer to write to disk. If it only takes in a pointer to an array to write to I think you could still use something like mmap on linux to write to disk

WardBrian · 2023-11-29T21:15:34Z

@SteveBronder You could do it with mmap

andrewjradcliffe · 2023-12-04T02:51:21Z

FFI would be nice, but, from inspection of ffistan there is still a call to make in order to build the .so. The resultant calls are through through the C ABI, which avoids the rigmarole of using a separate binary, spawning a process, communication, etc.

The proposed argument tree abstraction would provide a convenient way to direct high-level calls, analogous to the sample, optimize and pathfinder methods (Python) / functions (Julia) I see for the clients.

WardBrian · 2023-12-04T17:52:52Z

I think there could definitely be code-reuse between something in Rust that shells out to cmdstan and something that does it through CFFI. From a UX standpoint you still have the same issue with having a lot of arguments and no good way of doing defaults/named arguments except through a builder-like pattern.

andrewjradcliffe added 5 commits November 21, 2023 18:50

Draft, prior to addition of Prior Art section

c323f1a

Wishful numbering

5d6456f

Some obvious touch-ups after viewing the rendered markdown

1cbf527

Complete the sketch

2b5a086

Fix bash example

3f6dc99

WardBrian reviewed Nov 27, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rust interface for CmdStan #51

Rust interface for CmdStan #51

andrewjradcliffe commented Nov 23, 2023 •

edited

Loading

WardBrian left a comment

WardBrian Nov 27, 2023

WardBrian Nov 27, 2023

andrewjradcliffe Dec 3, 2023

WardBrian Nov 27, 2023

andrewjradcliffe Dec 3, 2023

WardBrian Nov 27, 2023

andrewjradcliffe Dec 3, 2023

WardBrian Nov 27, 2023

andrewjradcliffe Dec 4, 2023

WardBrian Nov 27, 2023

WardBrian Nov 27, 2023

andrewjradcliffe Dec 4, 2023

WardBrian Nov 27, 2023

WardBrian Nov 27, 2023

andrewjradcliffe Dec 4, 2023

WardBrian Dec 4, 2023

andrewjradcliffe Dec 5, 2023

WardBrian Dec 5, 2023

nhuurre Dec 5, 2023

andrewjradcliffe Dec 7, 2023

WardBrian Nov 27, 2023

andrewjradcliffe Dec 4, 2023 •

edited

Loading

WardBrian Nov 27, 2023

andrewjradcliffe Dec 4, 2023

SteveBronder commented Nov 27, 2023

WardBrian commented Nov 27, 2023

SteveBronder commented Nov 29, 2023

WardBrian commented Nov 29, 2023

andrewjradcliffe commented Dec 4, 2023

WardBrian commented Dec 4, 2023

		- a string written to a log file
		- a string which is consistent with the grammar that CmdStan accepts

Rust interface for CmdStan #51

Are you sure you want to change the base?

Rust interface for CmdStan #51

Conversation

andrewjradcliffe commented Nov 23, 2023 • edited Loading

WardBrian left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Summary

Choose a reason for hiding this comment

andrewjradcliffe Dec 4, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SteveBronder commented Nov 27, 2023

WardBrian commented Nov 27, 2023

SteveBronder commented Nov 29, 2023

WardBrian commented Nov 29, 2023

andrewjradcliffe commented Dec 4, 2023

WardBrian commented Dec 4, 2023

andrewjradcliffe commented Nov 23, 2023 •

edited

Loading

andrewjradcliffe Dec 4, 2023 •

edited

Loading