Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rust interface for CmdStan #51

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

andrewjradcliffe
Copy link

@andrewjradcliffe andrewjradcliffe commented Nov 23, 2023

N.B. no direct interface with C++!

This is a proposal for a CmdStan interface for the Rust programming language, analogous to the CmdStanPy et al. interfaces. There is one distinction, which some may consider a defect (critical input actively solicited): a strongly-typed representation of the CmdStan syntax tree.

link to rendered proposal: https://github.com/andrewjradcliffe/design-docs/blob/cmdstan-rs/designs/0035-cmdstanrs.md

Edit: add link

Copy link
Member

@WardBrian WardBrian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for putting this together. I left some thoughts/comments on individual points below, but I think the overall structure laid out here is a good direction for a rust interface to CmdStan. In particular, I at least am sold on the approach to the argument structure

Suppose that you write Rust code. Suppose that you use `Stan` for
probabilistic programming. You have three choices for creating an
application which utilizes both: shell scripts, introduce a scripting
language (e.g. Python) dependency, or a Rust interface for `CmdStan`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few more exotic options as well:

  • A (hypothetical) Rust interface to the stan/services APIs
  • Using Rust implementations of algorithms like NUTS and bridgestan, as is done in e.g. nutpie

These may be worth mentioning in the Rationale and Alternatives, as both have their own downsides as well.

Comment on lines +66 to +69
A `CmdStanModel` type will serve as an abstraction for a Stan program,
which may need to be compiled. Rather than compile on construction, a
compilation method must be explicitly called in user code (assuming
that a satisfactory executable does not exist yet).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This model is something CmdStanPy is moving away from, because a CmdStanModel is pretty useless without an executable.

In a strongly typed context, one could imagine that this type could instead be called something like StanProgram, and the result of calling compile is a CmdStanModel which is always associated with a executable on disk - this is sometimes called the "typestate" pattern

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent idea. As is often the case, putting a name to a common idiom (the typestate pattern) is illuminating.

appropriate term from each node, building up a linear argument list.
I sketched
[this](https://github.com/andrewjradcliffe/cmdstan-translator/blob/main/translate.scm)
out in Scheme, why I am not sure.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this section is necessary, providing an example of use and/or a list of expected structs may be more useful

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree -- this is at best a distraction.

[num-complex](https://github.com/rust-num/num-complex) are reasonable
choices, but nonetheless represent decisions to be made!

From a design perspective, this is a great place to defer to the user,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For ease of implementation this is true, but the list of above subtleties (to me) make a loud argument in the opposite direction: this is something hard, and the obvious approach will easily lead to incorrect results, so it should be handled by the library (eventually?)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alas, yes, it should be handled by the library. It could be gated by a feature, so as to delay the need for an implementation for the initial version. As additional dependencies would likely be introduced, a feature gate is probably appropriate anyway.

Comment on lines +235 to +237
Procedural macros, applied to a Stan program stored in a string
literal in a Rust program, could be used to generate types and a
parser for Stan CSVs produced said program. However, in order to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a middle-ground between this (very cool, but tricky) idea and doing nothing, which is providing some primitives (which may themselves be macros) that allow the user to provide the information required about types and names of variables. Worth considering, and it seems like any future automation could build on it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The gate here is: decide on conventions for representation of Stan types as Rust types. nalgebra is quite similar to Eigen, including memory layout, so matrices are covered.

For arrays of dimension greater than 2, there is a problem.

  • Yes, there is ndarray, but its future is uncertain.
  • Nested Vec<Vec<Vec ... >>> naturally works, but I doubt users would like it (no syntactic sugar for indexing).

If we take the latter option (the only sensible course for stable development), then it would be relatively straightforward to use a procedural macro to derive a parser for a user-provided struct where the field names and respective types must match the Stan CSV output. I do something similar for my use cases.

- At minimum, this will move a variety of run-time errors to compile
time; it might even help users to understand the methods and options
CmdStan provides.
- This enables re-use of a parameterized argument tree -- one could
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[comment] This is a very nice plus, IMO.

Comment on lines +289 to +290
- a string written to a log file
- a string which is consistent with the grammar that CmdStan accepts
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also like to support reading from JSON; that should be relatively easy.

I got interested in PEGs, went a little overboard, and now there's a parser which is consistent with CmdStan; it supports a variety of input formats.

data file=bernoulli.data.json
```

The proposal is to use [pest](https://github.com/pest-parser/pest),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I am curious about is total dependencies (it may be worth calling this out as its own list/table somewhere).

Both Python and R's wrappers at least strive to be minimal, within what their standard libraries provide. I think the wonder of cargo making dependencies on the whole less painful has a downside of them tending to proliferate.

Could items like this be hidden behind a cargo feature? Continuing the only-pay-for-what-you-want mentality from earlier

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Things like MSRV may also be worth mentioning!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If not for this function, the MSRV would be quite a bit older.

Assuming that we put the parser behind a feature gate, the only remaining dependency is thiserror, which could be eliminated by writing out all the fmt::Display implementations (probably a couple hundred LOC).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have any objections to thiserror - it seems quite common in practice and rather small in terms of scope.

You bring up an interesting other point: I (sort of implicitly) expected that this wrapper could be written using entirely safe Rust. Is that unsafe function really needed, or is it called as an optimization?

What other usages of unsafe would you anticipate, and why?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The single unsafe function arose from the need to split the string on the occurrence of '.', with the objective of replicating CmdStan's interesting treatment of paths. This could easily be eliminated if we weren't dealing with Os-specific details (hidden behind WTF-8 in OsString).

CmdStan basically just splits at the first occurrence of '.', starting from the right.
This becomes a problem when the only thing to the left of the first '.' is the empty string and there the input matches any of the forms below.

./bernoulli id=2 method=sample num_chains=3 data file=bernoulli.data.json output file='..'
# produces as output file names: `._2.`, `._3.`, `._4.`

./bernoulli id=2 method=sample num_chains=3 data file=bernoulli.data.json output file=.foo
# produces as output file names: `_2.foo`, `_3.foo`, `_4.foo`

# this is actually a repeat of the above, with "foo" as the empty string
./bernoulli id=2 method=sample num_chains=3 data file=bernoulli.data.json output file='.'
# produces as output file names: `_2.`, `_3.`, `_4.`

It is not possible to easily treat these by two edge cases due to the definition of an extension.
I might be able to come up with another way to do it, but it will complicate the code accordingly.

I see no issue with unsafe which is limited to a very specific use case that is fundamentally sound. This use is particularly easy to verify soundness: the &[u8] slices are created by as_encoded_bytes and promptly consumed; when the byte slices are re-joined, it is impossible for either to have become invalid UTF-8.

It is worth noting that OsStr::from_encoded_bytes_unchecked only appears in two places; both appearances involve inputs which are created within the same function body, that is, no foreign inputs, no intermediates escape the scope in which they are created and said intermediates are consumed in the same function body. Incidentally, it is more efficient to play with byte slices, but that was not my objective.

Each use looks like:

let bytes = path.as_encoded_bytes();
let mut iter = bytes.rsplitn(2, |b| *b == b'.');
let (prefix, suffix) = match (iter.next(), iter.next()) {
    (Some(suffix), Some(prefix)) => {
        // SAFETY:
        // - each fragment only contains content that originated
        //   from `OsStr::as_encoded_bytes`.
        // - split with ASCII period, which is a non-empty UTF-8
        //   substring
        unsafe {
            (
                OsStr::from_encoded_bytes_unchecked(prefix),
                OsStr::from_encoded_bytes_unchecked(suffix),
            )
        }
    }
    _ => (path, "csv".as_ref()),
};

I general, my philosophy on use of unsafe is that it is best to avoid where possible; for a crate like this, I certainly wouldn't expose any unsafe public APIs. There's no reason to expose an unsafe internal API for this crate, as there is nothing that warrants it here.

I realize that my response above could actually be read as somewhat alarming -- potentially implies lots of unsafe. Limited to two locations (1, 2) with no plans to expand beyond these, I hope that the above is reassuring.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also sounds like something we will probably change (stan-dev/cmdstan#1213) - the current system is mostly designed-by-accident

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Your use of unsafe seem fine to me. However, since essentially the same unsafe code is used in two places I'd encapsulate it in a single safe-to-call helper function.
  2. This file extension handling corner case feels like a CmdStan bug and I wouldn't bother supporting it beyond a clear error message.
  3. It is quite frustrating that as_/from_encoded_bytes does not provide a safe API for splitting OsStrs. Nevertheless, in this case unsafe can be avoided with a couple of allocations:
use std::ffi::{OsString, OsStr};
let (prefix, suffix) = match (path.file_stem(), path.extension()) {
    (Some(prefix), Some(suffix)) => (prefix, suffix.to_os_string()),
    (Some(name), None) if name.as_encoded_bytes().first() == Some(&b'.') => {
        let mut fake_name = std::path::PathBuf::from("x");
        fake_name.as_mut_os_string().push(name);
        (OsStr::new(""), fake_name.extension().unwrap().to_os_string())
    }
    (Some(name), None) =>
        (name, OsString::from("csv")),
    (None, _) => {
        // No file name, which means the last path component is either "." or ".."
        if path.as_os_str().as_encoded_bytes().ends_with(b"..") {
            (OsStr::new("."), OsString::from(""))
        } else {
            (OsStr::new(""), OsString::from(""))
        }
    }
};
  1. Coming back to point (2), if the user specifies output file=out.d/file CmdStan tries to write to out_1.d/file etc but if the directory out_1.d does not exist CmdnStan won't create it and the output just silently disappears. That can't be right.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order:

  1. I completely agree. It should be moved to a single function.
  2. I agree again. I found this when I was writing tests for the Rust code. However, if CmdStan does it, it seems fine to support it if it is easy enough. Splitting WTF-8 encoded bytes on ASCII is fundamentally sound, so the unsafe-ness doesn't bother me. In fact, std uses the same approach; I had not looked at std until now.
  3. It is only the conversion back to OsStr that is unsafe, but I agree that a safe API would be nice. Unfortunately, literal split on the first period from the right is very hard to avoid. Your approach is clever, but is defeated by inputs such as below:
function bernoulli() {
    ./bernoulli data file=bernoulli.data.json sample num_chains=3 output file=$1
}

mkdir -p foo/bar
                            # files created by CmdStan
bernoulli 'foo/.bar'        # "foo/_1.bar", "foo/_2.bar", "foo/_3.bar"
bernoulli 'foo/bar/'        # "foo/bar/_1.csv", "foo/bar/_2.csv", "foo/bar/_3.csv"
bernoulli 'foo/bar/.'       # "foo/bar/_1.", "foo/bar/_2.", "foo/bar/_3."
bernoulli 'foo/bar/..'      # "foo/bar/._2.", "foo/bar/._3.", "foo/bar/._4."
bernoulli 'foo/bar/...'     # "foo/bar/.._2.", "foo/bar/.._3.", "foo/bar/.._4."

One might think that path::parent and path::file_name would reduce the problem to the code you gave, substituting the result of path::file_name, but that does not work due to path normalization. For example, foo/bar/. is equivalent to foo/bar when treated as a path; likewise for foo/bar/.

From a code complexity standpoint, splitting at the '.' byte is much simpler, as it does not require addition of more case logic, and matches what CmdStan does by default.

This is a good example of where unsafe is both easier to maintain and verify the correctness of. Using the std::path API quickly becomes cumbersome, as it performs the normalization of path components one would naturally expect; in essence, the logic of the program must now test for all the ways that a string with strange placements of '.' would be subject to path normalization, then avoid them.

  1. This was sufficiently interesting that I wrote it as an issue for the CmdStan repository.

Summary

Until CmdStan switches from splitting on the last dot, there is no reason to deviate from said behavior in the Rust treatment of paths. Path normalization is not trivial (see cppreference for a nice recapitulation of the steps). Effort spent on attempting to reproduce the CmdStan behavior via a convoluted use of Path and OsStr, on the basis that no code should be unsafe, is likely to lead to logical errors-- I am not keen on working out all pathological (no pun intended) behavior induced by lack of path normalization and splitting on the last '.'. It is much simpler and maintainable to verify that the invariants of an unsafe function are satisfied than to create a Rube Goldberg machine.

Examples like Case 2 demonstrate that such a machine would not be feasible, as it is not sufficient to split parent and child, check the respective start/end, and be done. One must scan the entire string for dots and this is only possible after as_bytes_encoded, hence, we are back to unsafe to put the halves back together (i.e. right back to the original proposal).

Furthermore, the discrepancy between CmdStan's behavior and the proposed interface would be a a potential source of confusion. If someone wants to use wacky file names and the program (CmdStan) accepts them, we would be remiss to impose artificial limitations.

The objective is for `CmdStanOutput` to be a self-contained record
which includes all pertinent information from a successful executable
call. This structure can then be used to direct calls of
`diagnose`/`stansummary`. Naturally, methods on said type will be
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you supply a complete list of signatures for methods on CmdStanOutput? You mention this path helper (very nice to provide, somewhat tricky to get right!) and stansummary, but what would their signatures look like, and are there more? I'm sure some of this will need to be discovered during implementation

Copy link
Author

@andrewjradcliffe andrewjradcliffe Dec 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Public signatures are (at time of writing)

/// Return the output files associated with the `CmdStan` call.
pub fn output_files(&self) -> Vec<PathBuf>;

/// Return the diagnostic files associated with the `CmdStan` call.
pub fn diagnostic_files(&self) -> Vec<PathBuf>;

/// Return the profile files associated with the `CmdStan` call.
pub fn profile_files(&self) -> Vec<PathBuf>;

/// Return an immutable reference to console output associated
/// with the `CmdStan` call.
pub fn output(&self) -> &process::Output;

/// Return a reference to the current working directory at the
/// time of the call.
pub fn cwd_at_call(&self) -> &Path;

/// Return a reference to the argument tree with which the call
/// was made.
pub fn argument_tree(&self) -> &ArgumentTree;

// EDIT: on second thought, no. The next 3 methods introduce an unnecessary
// coupling of state. 
// pub fn cmdstan(&self) -> &Path
// pub fn diagnose(&self) -> io::Result<process::Output>;
// pub fn stansummary(&self, opts: Option<StanSummaryOptions>) -> io::Result<process::Output>

/// Write the console output to a file.  The path at which the
/// file will be created is determined in the following manner: If
/// `file` is `None`, `"log.txt"` is adjoined on to `cwd_at_call`.
/// If `file` is `Some(path)` and `path` is a relative path, then
/// `path` is adjoined on to `cwd_at_call`; otherwise, `path` is
/// assumed to be an absolute path and is used without
/// modification. Upon successful write, the path at which the
/// file was created is returned.
pub fn write_output<P: AsRef<Path>>(&self, file: Option<P>) -> io::Result<PathBuf>

I don't know if I would keep write_output.

The path helper is actually pretty easy to figure out if you have the argument tree and cache the path of the working directory at the point of call. If the user had any relative paths, then said paths must have been relative to the directory at which they made the call (else the call fails and we don't make it this far); if they used absolute paths, then everything is sensible.

edit: on reflection, I should not blithely underestimate the ability of users to create valid-but-spaghetti paths.


Methods (receiver: `CmdStanModel` instance) exposed to the user will
include:
- `validate_cmdstan` : determine whether the CmdStan installation works
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both this and passing the path to the CmdStan installation feel like they're concerns separate from a CmdStanModel instance to some extent.

Does it make sense to have a CmdStanInstallation struct which stores a path and validation logic? I'm not super familiar with what is idiomatic, but could there even be a lazy_once global which is used by default, and an overridden constructor if the user needs to supply their own or two separate instances in the same program?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. The path to the cmdstan installation going through CmdStanModel is definitely an undesirable thing. Separating into something like

#[derive(Clone)]
pub struct CmdStanInstallation { path: PathBuf }
pub struct StanProgram { path: PathBuf }

impl CmdStanInstallation {
    fn validate(&self) -> io::Result<bool>; // you can break the install by deleting it
    pub fn compile(&self, prog: &StanProgram) -> Result<CmdStanModel, CompilationError>;
    pub fn try_from_env() -> io::Result<Self>; // we look for `CMDSTAN` in environment
}

impl TryFrom<&Path> for CmdStanInstallation { /* validate CmdStan works, else error */ }

Using an environment variable is a reasonably consistent way to set a global without committing to a lazily-initialized global in Rust. Admittedly, this just moves the problem elsewhere, but it is clear what you would be getting -- the current CmdStan, as determined by a specific environment variable.

I am not fond of user-facing globals (unless they truly are just static data, e.g. color schemes). If the global depends on something a user is to provide, it gets messy. If we provide an interface that requires a validated CmdStan installation, then it's up to the user to figure out how to meet said requirement. I don't see much harm placing the complexity on the user -- after all, it becomes our problem if it is ambiguous as to which CmdStan should be used at any given point.

If a user really wanted to do it via global, the code below is an option.

static CMDSTAN: once_cell::sync::Lazy<PathBuf> = once_cell::sync::Lazy::new(|| infallible_go_get_cmdstan());

fn my_cmdstan_install() -> CmdStanInstallation {
    let path: &Path = &*CMDSTAN;
    CmdStanInstallation::try_from(path).unwrap()
}

@SteveBronder
Copy link
Collaborator

Kind of adjacent to this proposal, for a compiled language would something like a foreign function interface scheme be more appropriate? Using something like ffistan the rust interface would not need to compile a separate binary

@WardBrian
Copy link
Member

Yeah, wrapping ffistan in rust is on my to-do list after I finish fleshing out the julia/python/R... but I can also imagine situations where someone would prefer a cmdstan wrapper even if they had an in-memory approach available. The simplest such example is one where you need to store a large number of draws for a large number of parameters and the total size is too big to store in memory

@SteveBronder
Copy link
Collaborator

@WardBrian Does ffistan not use Stan's writer class? Then folks could just use the writer to write to disk. If it only takes in a pointer to an array to write to I think you could still use something like mmap on linux to write to disk

@WardBrian
Copy link
Member

@SteveBronder You could do it with mmap

@andrewjradcliffe
Copy link
Author

FFI would be nice, but, from inspection of ffistan there is still a call to make in order to build the .so. The resultant calls are through through the C ABI, which avoids the rigmarole of using a separate binary, spawning a process, communication, etc.

The proposed argument tree abstraction would provide a convenient way to direct high-level calls, analogous to the sample, optimize and pathfinder methods (Python) / functions (Julia) I see for the clients.

@WardBrian
Copy link
Member

I think there could definitely be code-reuse between something in Rust that shells out to cmdstan and something that does it through CFFI. From a UX standpoint you still have the same issue with having a lot of arguments and no good way of doing defaults/named arguments except through a builder-like pattern.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants