-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rust interface for CmdStan #51
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for putting this together. I left some thoughts/comments on individual points below, but I think the overall structure laid out here is a good direction for a rust interface to CmdStan. In particular, I at least am sold on the approach to the argument structure
Suppose that you write Rust code. Suppose that you use `Stan` for | ||
probabilistic programming. You have three choices for creating an | ||
application which utilizes both: shell scripts, introduce a scripting | ||
language (e.g. Python) dependency, or a Rust interface for `CmdStan`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a few more exotic options as well:
- A (hypothetical) Rust interface to the
stan/services
APIs - Using Rust implementations of algorithms like NUTS and
bridgestan
, as is done in e.g.nutpie
These may be worth mentioning in the Rationale and Alternatives, as both have their own downsides as well.
A `CmdStanModel` type will serve as an abstraction for a Stan program, | ||
which may need to be compiled. Rather than compile on construction, a | ||
compilation method must be explicitly called in user code (assuming | ||
that a satisfactory executable does not exist yet). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This model is something CmdStanPy is moving away from, because a CmdStanModel is pretty useless without an executable.
In a strongly typed context, one could imagine that this type could instead be called something like StanProgram
, and the result of calling compile
is a CmdStanModel
which is always associated with a executable on disk - this is sometimes called the "typestate" pattern
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent idea. As is often the case, putting a name to a common idiom (the typestate pattern) is illuminating.
appropriate term from each node, building up a linear argument list. | ||
I sketched | ||
[this](https://github.com/andrewjradcliffe/cmdstan-translator/blob/main/translate.scm) | ||
out in Scheme, why I am not sure. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this section is necessary, providing an example of use and/or a list of expected structs may be more useful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree -- this is at best a distraction.
[num-complex](https://github.com/rust-num/num-complex) are reasonable | ||
choices, but nonetheless represent decisions to be made! | ||
|
||
From a design perspective, this is a great place to defer to the user, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For ease of implementation this is true, but the list of above subtleties (to me) make a loud argument in the opposite direction: this is something hard, and the obvious approach will easily lead to incorrect results, so it should be handled by the library (eventually?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alas, yes, it should be handled by the library. It could be gated by a feature, so as to delay the need for an implementation for the initial version. As additional dependencies would likely be introduced, a feature gate is probably appropriate anyway.
Procedural macros, applied to a Stan program stored in a string | ||
literal in a Rust program, could be used to generate types and a | ||
parser for Stan CSVs produced said program. However, in order to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a middle-ground between this (very cool, but tricky) idea and doing nothing, which is providing some primitives (which may themselves be macros) that allow the user to provide the information required about types and names of variables. Worth considering, and it seems like any future automation could build on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The gate here is: decide on conventions for representation of Stan types as Rust types. nalgebra
is quite similar to Eigen, including memory layout, so matrices are covered.
For arrays of dimension greater than 2, there is a problem.
- Yes, there is
ndarray
, but itsfuture is uncertain
. - Nested
Vec<Vec<Vec ... >>>
naturally works, but I doubt users would like it (no syntactic sugar for indexing).
If we take the latter option (the only sensible course for stable development), then it would be relatively straightforward to use a procedural macro to derive a parser for a user-provided struct where the field names and respective types must match the Stan CSV output. I do something similar for my use cases.
- At minimum, this will move a variety of run-time errors to compile | ||
time; it might even help users to understand the methods and options | ||
CmdStan provides. | ||
- This enables re-use of a parameterized argument tree -- one could |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[comment] This is a very nice plus, IMO.
- a string written to a log file | ||
- a string which is consistent with the grammar that CmdStan accepts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or, one hopes, the JSON file cmdstan itself can now produce
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also like to support reading from JSON; that should be relatively easy.
I got interested in PEGs, went a little overboard, and now there's a parser which is consistent with CmdStan; it supports a variety of input formats.
data file=bernoulli.data.json | ||
``` | ||
|
||
The proposal is to use [pest](https://github.com/pest-parser/pest), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing I am curious about is total dependencies (it may be worth calling this out as its own list/table somewhere).
Both Python and R's wrappers at least strive to be minimal, within what their standard libraries provide. I think the wonder of cargo
making dependencies on the whole less painful has a downside of them tending to proliferate.
Could items like this be hidden behind a cargo feature? Continuing the only-pay-for-what-you-want mentality from earlier
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Things like MSRV may also be worth mentioning!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If not for this function, the MSRV would be quite a bit older.
Assuming that we put the parser behind a feature gate, the only remaining dependency is thiserror
, which could be eliminated by writing out all the fmt::Display
implementations (probably a couple hundred LOC).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have any objections to thiserror
- it seems quite common in practice and rather small in terms of scope.
You bring up an interesting other point: I (sort of implicitly) expected that this wrapper could be written using entirely safe Rust. Is that unsafe
function really needed, or is it called as an optimization?
What other usages of unsafe
would you anticipate, and why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The single unsafe function arose from the need to split the string on the occurrence of '.'
, with the objective of replicating CmdStan's interesting treatment of paths. This could easily be eliminated if we weren't dealing with Os-specific details (hidden behind WTF-8 in OsString).
CmdStan basically just splits at the first occurrence of '.'
, starting from the right.
This becomes a problem when the only thing to the left of the first '.'
is the empty string and there the input matches any of the forms below.
./bernoulli id=2 method=sample num_chains=3 data file=bernoulli.data.json output file='..'
# produces as output file names: `._2.`, `._3.`, `._4.`
./bernoulli id=2 method=sample num_chains=3 data file=bernoulli.data.json output file=.foo
# produces as output file names: `_2.foo`, `_3.foo`, `_4.foo`
# this is actually a repeat of the above, with "foo" as the empty string
./bernoulli id=2 method=sample num_chains=3 data file=bernoulli.data.json output file='.'
# produces as output file names: `_2.`, `_3.`, `_4.`
It is not possible to easily treat these by two edge cases due to the definition of an extension.
I might be able to come up with another way to do it, but it will complicate the code accordingly.
I see no issue with unsafe
which is limited to a very specific use case that is fundamentally sound. This use is particularly easy to verify soundness: the &[u8]
slices are created by as_encoded_bytes
and promptly consumed; when the byte slices are re-joined, it is impossible for either to have become invalid UTF-8.
It is worth noting that OsStr::from_encoded_bytes_unchecked
only appears in two places; both appearances involve inputs which are created within the same function body, that is, no foreign inputs, no intermediates escape the scope in which they are created and said intermediates are consumed in the same function body. Incidentally, it is more efficient to play with byte slices, but that was not my objective.
Each use looks like:
let bytes = path.as_encoded_bytes();
let mut iter = bytes.rsplitn(2, |b| *b == b'.');
let (prefix, suffix) = match (iter.next(), iter.next()) {
(Some(suffix), Some(prefix)) => {
// SAFETY:
// - each fragment only contains content that originated
// from `OsStr::as_encoded_bytes`.
// - split with ASCII period, which is a non-empty UTF-8
// substring
unsafe {
(
OsStr::from_encoded_bytes_unchecked(prefix),
OsStr::from_encoded_bytes_unchecked(suffix),
)
}
}
_ => (path, "csv".as_ref()),
};
I general, my philosophy on use of unsafe
is that it is best to avoid where possible; for a crate like this, I certainly wouldn't expose any unsafe
public APIs. There's no reason to expose an unsafe internal API for this crate, as there is nothing that warrants it here.
I realize that my response above could actually be read as somewhat alarming -- potentially implies lots of unsafe
. Limited to two locations (1, 2) with no plans to expand beyond these, I hope that the above is reassuring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also sounds like something we will probably change (stan-dev/cmdstan#1213) - the current system is mostly designed-by-accident
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Your use of
unsafe
seem fine to me. However, since essentially the same unsafe code is used in two places I'd encapsulate it in a single safe-to-call helper function. - This file extension handling corner case feels like a CmdStan bug and I wouldn't bother supporting it beyond a clear error message.
- It is quite frustrating that
as_/from_encoded_bytes
does not provide a safe API for splittingOsStr
s. Nevertheless, in this caseunsafe
can be avoided with a couple of allocations:
use std::ffi::{OsString, OsStr};
let (prefix, suffix) = match (path.file_stem(), path.extension()) {
(Some(prefix), Some(suffix)) => (prefix, suffix.to_os_string()),
(Some(name), None) if name.as_encoded_bytes().first() == Some(&b'.') => {
let mut fake_name = std::path::PathBuf::from("x");
fake_name.as_mut_os_string().push(name);
(OsStr::new(""), fake_name.extension().unwrap().to_os_string())
}
(Some(name), None) =>
(name, OsString::from("csv")),
(None, _) => {
// No file name, which means the last path component is either "." or ".."
if path.as_os_str().as_encoded_bytes().ends_with(b"..") {
(OsStr::new("."), OsString::from(""))
} else {
(OsStr::new(""), OsString::from(""))
}
}
};
- Coming back to point (2), if the user specifies
output file=out.d/file
CmdStan tries to write toout_1.d/file
etc but if the directoryout_1.d
does not exist CmdnStan won't create it and the output just silently disappears. That can't be right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In order:
- I completely agree. It should be moved to a single function.
- I agree again. I found this when I was writing tests for the Rust code. However, if CmdStan does it, it seems fine to support it if it is easy enough. Splitting WTF-8 encoded bytes on ASCII is fundamentally sound, so the
unsafe
-ness doesn't bother me. In fact,std
uses the same approach; I had not looked atstd
until now. - It is only the conversion back to
OsStr
that isunsafe
, but I agree that a safe API would be nice. Unfortunately, literal split on the first period from the right is very hard to avoid. Your approach is clever, but is defeated by inputs such as below:
function bernoulli() {
./bernoulli data file=bernoulli.data.json sample num_chains=3 output file=$1
}
mkdir -p foo/bar
# files created by CmdStan
bernoulli 'foo/.bar' # "foo/_1.bar", "foo/_2.bar", "foo/_3.bar"
bernoulli 'foo/bar/' # "foo/bar/_1.csv", "foo/bar/_2.csv", "foo/bar/_3.csv"
bernoulli 'foo/bar/.' # "foo/bar/_1.", "foo/bar/_2.", "foo/bar/_3."
bernoulli 'foo/bar/..' # "foo/bar/._2.", "foo/bar/._3.", "foo/bar/._4."
bernoulli 'foo/bar/...' # "foo/bar/.._2.", "foo/bar/.._3.", "foo/bar/.._4."
One might think that path::parent
and path::file_name
would reduce the problem to the code you gave, substituting the result of path::file_name
, but that does not work due to path normalization. For example, foo/bar/.
is equivalent to foo/bar
when treated as a path; likewise for foo/bar/
.
From a code complexity standpoint, splitting at the '.'
byte is much simpler, as it does not require addition of more case logic, and matches what CmdStan does by default.
This is a good example of where unsafe
is both easier to maintain and verify the correctness of. Using the std::path
API quickly becomes cumbersome, as it performs the normalization of path components one would naturally expect; in essence, the logic of the program must now test for all the ways that a string with strange placements of '.'
would be subject to path normalization, then avoid them.
- This was sufficiently interesting that I wrote it as an issue for the CmdStan repository.
Summary
Until CmdStan switches from splitting on the last dot, there is no reason to deviate from said behavior in the Rust treatment of paths. Path normalization is not trivial (see cppreference for a nice recapitulation of the steps). Effort spent on attempting to reproduce the CmdStan behavior via a convoluted use of Path
and OsStr
, on the basis that no code should be unsafe
, is likely to lead to logical errors-- I am not keen on working out all pathological (no pun intended) behavior induced by lack of path normalization and splitting on the last '.'
. It is much simpler and maintainable to verify that the invariants of an unsafe
function are satisfied than to create a Rube Goldberg machine.
Examples like Case 2 demonstrate that such a machine would not be feasible, as it is not sufficient to split parent and child, check the respective start/end, and be done. One must scan the entire string for dots and this is only possible after as_bytes_encoded
, hence, we are back to unsafe
to put the halves back together (i.e. right back to the original proposal).
Furthermore, the discrepancy between CmdStan's behavior and the proposed interface would be a a potential source of confusion. If someone wants to use wacky file names and the program (CmdStan) accepts them, we would be remiss to impose artificial limitations.
The objective is for `CmdStanOutput` to be a self-contained record | ||
which includes all pertinent information from a successful executable | ||
call. This structure can then be used to direct calls of | ||
`diagnose`/`stansummary`. Naturally, methods on said type will be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you supply a complete list of signatures for methods on CmdStanOutput
? You mention this path helper (very nice to provide, somewhat tricky to get right!) and stansummary
, but what would their signatures look like, and are there more? I'm sure some of this will need to be discovered during implementation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Public signatures are (at time of writing)
/// Return the output files associated with the `CmdStan` call.
pub fn output_files(&self) -> Vec<PathBuf>;
/// Return the diagnostic files associated with the `CmdStan` call.
pub fn diagnostic_files(&self) -> Vec<PathBuf>;
/// Return the profile files associated with the `CmdStan` call.
pub fn profile_files(&self) -> Vec<PathBuf>;
/// Return an immutable reference to console output associated
/// with the `CmdStan` call.
pub fn output(&self) -> &process::Output;
/// Return a reference to the current working directory at the
/// time of the call.
pub fn cwd_at_call(&self) -> &Path;
/// Return a reference to the argument tree with which the call
/// was made.
pub fn argument_tree(&self) -> &ArgumentTree;
// EDIT: on second thought, no. The next 3 methods introduce an unnecessary
// coupling of state.
// pub fn cmdstan(&self) -> &Path
// pub fn diagnose(&self) -> io::Result<process::Output>;
// pub fn stansummary(&self, opts: Option<StanSummaryOptions>) -> io::Result<process::Output>
/// Write the console output to a file. The path at which the
/// file will be created is determined in the following manner: If
/// `file` is `None`, `"log.txt"` is adjoined on to `cwd_at_call`.
/// If `file` is `Some(path)` and `path` is a relative path, then
/// `path` is adjoined on to `cwd_at_call`; otherwise, `path` is
/// assumed to be an absolute path and is used without
/// modification. Upon successful write, the path at which the
/// file was created is returned.
pub fn write_output<P: AsRef<Path>>(&self, file: Option<P>) -> io::Result<PathBuf>
I don't know if I would keep write_output
.
The path helper is actually pretty easy to figure out if you have the argument tree and cache the path of the working directory at the point of call. If the user had any relative paths, then said paths must have been relative to the directory at which they made the call (else the call fails and we don't make it this far); if they used absolute paths, then everything is sensible.
edit: on reflection, I should not blithely underestimate the ability of users to create valid-but-spaghetti paths.
|
||
Methods (receiver: `CmdStanModel` instance) exposed to the user will | ||
include: | ||
- `validate_cmdstan` : determine whether the CmdStan installation works |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both this and passing the path to the CmdStan installation feel like they're concerns separate from a CmdStanModel instance to some extent.
Does it make sense to have a CmdStanInstallation
struct which stores a path and validation logic? I'm not super familiar with what is idiomatic, but could there even be a lazy_once global which is used by default, and an overridden constructor if the user needs to supply their own or two separate instances in the same program?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. The path to the cmdstan installation going through CmdStanModel
is definitely an undesirable thing. Separating into something like
#[derive(Clone)]
pub struct CmdStanInstallation { path: PathBuf }
pub struct StanProgram { path: PathBuf }
impl CmdStanInstallation {
fn validate(&self) -> io::Result<bool>; // you can break the install by deleting it
pub fn compile(&self, prog: &StanProgram) -> Result<CmdStanModel, CompilationError>;
pub fn try_from_env() -> io::Result<Self>; // we look for `CMDSTAN` in environment
}
impl TryFrom<&Path> for CmdStanInstallation { /* validate CmdStan works, else error */ }
Using an environment variable is a reasonably consistent way to set a global without committing to a lazily-initialized global in Rust. Admittedly, this just moves the problem elsewhere, but it is clear what you would be getting -- the current CmdStan, as determined by a specific environment variable.
I am not fond of user-facing globals (unless they truly are just static data, e.g. color schemes). If the global depends on something a user is to provide, it gets messy. If we provide an interface that requires a validated CmdStan installation, then it's up to the user to figure out how to meet said requirement. I don't see much harm placing the complexity on the user -- after all, it becomes our problem if it is ambiguous as to which CmdStan should be used at any given point.
If a user really wanted to do it via global, the code below is an option.
static CMDSTAN: once_cell::sync::Lazy<PathBuf> = once_cell::sync::Lazy::new(|| infallible_go_get_cmdstan());
fn my_cmdstan_install() -> CmdStanInstallation {
let path: &Path = &*CMDSTAN;
CmdStanInstallation::try_from(path).unwrap()
}
Kind of adjacent to this proposal, for a compiled language would something like a foreign function interface scheme be more appropriate? Using something like ffistan the rust interface would not need to compile a separate binary |
Yeah, wrapping ffistan in rust is on my to-do list after I finish fleshing out the julia/python/R... but I can also imagine situations where someone would prefer a cmdstan wrapper even if they had an in-memory approach available. The simplest such example is one where you need to store a large number of draws for a large number of parameters and the total size is too big to store in memory |
@WardBrian Does ffistan not use Stan's writer class? Then folks could just use the writer to write to disk. If it only takes in a pointer to an array to write to I think you could still use something like |
@SteveBronder You could do it with mmap |
FFI would be nice, but, from inspection of The proposed argument tree abstraction would provide a convenient way to direct high-level calls, analogous to the |
I think there could definitely be code-reuse between something in Rust that shells out to cmdstan and something that does it through CFFI. From a UX standpoint you still have the same issue with having a lot of arguments and no good way of doing defaults/named arguments except through a builder-like pattern. |
N.B. no direct interface with C++!
This is a proposal for a CmdStan interface for the Rust programming language, analogous to the CmdStanPy et al. interfaces. There is one distinction, which some may consider a defect (critical input actively solicited): a strongly-typed representation of the CmdStan syntax tree.
link to rendered proposal: https://github.com/andrewjradcliffe/design-docs/blob/cmdstan-rs/designs/0035-cmdstanrs.md
Edit: add link