Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rust interface for CmdStan #51

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
348 changes: 348 additions & 0 deletions designs/0035-cmdstanrs.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,348 @@
- Feature Name: cmdstanrs
- Start Date: 2023-11-22
- RFC PR:
- Stan Issue:

# Summary
[summary]: #summary

This is a proposal for a Rust interface for CmdStan through compiled
executables, that is, no direct interface with C++.

The goal is to provide an interface which enables users to:
- compile Stan programs (with arbitrary options)
- build and compose arguments/options (to be passed to C++
executables) in idiomatic Rust
- call C++ executables, then memoize input and collect output (thereby
making these available for programmatic use)
- call `diagnose` and `stansummary` tools and collect output

The objective is to keep the interface as simple as possible.

# Motivation
[motivation]: #motivation

Suppose that you write Rust code. Suppose that you use `Stan` for
probabilistic programming. You have three choices for creating an
application which utilizes both: shell scripts, introduce a scripting
language (e.g. Python) dependency, or a Rust interface for `CmdStan`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few more exotic options as well:

  • A (hypothetical) Rust interface to the stan/services APIs
  • Using Rust implementations of algorithms like NUTS and bridgestan, as is done in e.g. nutpie

These may be worth mentioning in the Rationale and Alternatives, as both have their own downsides as well.


Orchestration using the shell suffers from portability issues and
often leads unnecessary fragmentation of control flow. Introducing a
scripting language may confer portability, but control flow and error
handling are now divided between two languages; furthermore, this
necessitates code be written to serialize/deserialize intermediates.

A Rust interface, in similar spirit to the CmdStan interfaces from
other languages, would provide the necessary abstraction to eliminate
the aforementioned problems.

# Functional Specification

Given a Stan program, a user of the library will compile the model (if
desired), call the executable with arguments (translated from
strongly-typed argument tree), and obtain a self-contained context
which encapsulates the pertinent information from the call.

### Assumptions

We assume (at our peril) that Rust programmers that will be able to
figure out how to satisfy the following requirement:
- a working CmdStan installation exists at some user-accessible path

### Processes, IO

The proposal is to use the Rust `std` library, in particular the
[process](https://doc.rust-lang.org/std/process/index.html),
[path](https://doc.rust-lang.org/std/path/index.html),
[fs](https://doc.rust-lang.org/std/fs/index.html), and
[ffi](https://doc.rust-lang.org/std/ffi/index.html) modules, to
orchestrate processes, interact with file system and handle
cross-platform concerns. This will yield a library which is portable,
provided that it is (cross-)compiled for the intended target.

## Control: compilation and calling the resultant executable

A `CmdStanModel` type will serve as an abstraction for a Stan program,
which may need to be compiled. Rather than compile on construction, a
compilation method must be explicitly called in user code (assuming
that a satisfactory executable does not exist yet).
Comment on lines +66 to +69
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This model is something CmdStanPy is moving away from, because a CmdStanModel is pretty useless without an executable.

In a strongly typed context, one could imagine that this type could instead be called something like StanProgram, and the result of calling compile is a CmdStanModel which is always associated with a executable on disk - this is sometimes called the "typestate" pattern

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent idea. As is often the case, putting a name to a common idiom (the typestate pattern) is illuminating.


Two arguments will be necessary to create a `CmdStanModel`:
1. a path to a CmdStan installation
2. a path to a Stan program

Methods (receiver: `CmdStanModel` instance) exposed to the user will
include:
- `validate_cmdstan` : determine whether the CmdStan installation works
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both this and passing the path to the CmdStan installation feel like they're concerns separate from a CmdStanModel instance to some extent.

Does it make sense to have a CmdStanInstallation struct which stores a path and validation logic? I'm not super familiar with what is idiomatic, but could there even be a lazy_once global which is used by default, and an overridden constructor if the user needs to supply their own or two separate instances in the same program?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. The path to the cmdstan installation going through CmdStanModel is definitely an undesirable thing. Separating into something like

#[derive(Clone)]
pub struct CmdStanInstallation { path: PathBuf }
pub struct StanProgram { path: PathBuf }

impl CmdStanInstallation {
    fn validate(&self) -> io::Result<bool>; // you can break the install by deleting it
    pub fn compile(&self, prog: &StanProgram) -> Result<CmdStanModel, CompilationError>;
    pub fn try_from_env() -> io::Result<Self>; // we look for `CMDSTAN` in environment
}

impl TryFrom<&Path> for CmdStanInstallation { /* validate CmdStan works, else error */ }

Using an environment variable is a reasonably consistent way to set a global without committing to a lazily-initialized global in Rust. Admittedly, this just moves the problem elsewhere, but it is clear what you would be getting -- the current CmdStan, as determined by a specific environment variable.

I am not fond of user-facing globals (unless they truly are just static data, e.g. color schemes). If the global depends on something a user is to provide, it gets messy. If we provide an interface that requires a validated CmdStan installation, then it's up to the user to figure out how to meet said requirement. I don't see much harm placing the complexity on the user -- after all, it becomes our problem if it is ambiguous as to which CmdStan should be used at any given point.

If a user really wanted to do it via global, the code below is an option.

static CMDSTAN: once_cell::sync::Lazy<PathBuf> = once_cell::sync::Lazy::new(|| infallible_go_get_cmdstan());

fn my_cmdstan_install() -> CmdStanInstallation {
    let path: &Path = &*CMDSTAN;
    CmdStanInstallation::try_from(path).unwrap()
}

- `executable_works` : is there a working executable at the path
implied by the Stan file?
- `compile_with_args` : attempt to compile a Stan program with
optional `make` arguments
- `call_executable` : call executable with the given argument tree; on
success, return a `CmdStanOutput` instance.

## Output

Output of a successful `call_executable` call on a `CmdStanModel` will
produce a `CmdStanOuput` instance, which encapsulates the context of
the call. This includes:
- the console output (exit status, stdout and stderr),
- the argument tree provided to `call_executable`
- the current working directory of the process at the time the call was made
- the CmdStan installation from the parent `CmdStanModel`

The objective is for `CmdStanOutput` to be a self-contained record
which includes all pertinent information from a successful executable
call. This structure can then be used to direct calls of
`diagnose`/`stansummary`. Naturally, methods on said type will be
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you supply a complete list of signatures for methods on CmdStanOutput? You mention this path helper (very nice to provide, somewhat tricky to get right!) and stansummary, but what would their signatures look like, and are there more? I'm sure some of this will need to be discovered during implementation

Copy link
Author

@andrewjradcliffe andrewjradcliffe Dec 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Public signatures are (at time of writing)

/// Return the output files associated with the `CmdStan` call.
pub fn output_files(&self) -> Vec<PathBuf>;

/// Return the diagnostic files associated with the `CmdStan` call.
pub fn diagnostic_files(&self) -> Vec<PathBuf>;

/// Return the profile files associated with the `CmdStan` call.
pub fn profile_files(&self) -> Vec<PathBuf>;

/// Return an immutable reference to console output associated
/// with the `CmdStan` call.
pub fn output(&self) -> &process::Output;

/// Return a reference to the current working directory at the
/// time of the call.
pub fn cwd_at_call(&self) -> &Path;

/// Return a reference to the argument tree with which the call
/// was made.
pub fn argument_tree(&self) -> &ArgumentTree;

// EDIT: on second thought, no. The next 3 methods introduce an unnecessary
// coupling of state. 
// pub fn cmdstan(&self) -> &Path
// pub fn diagnose(&self) -> io::Result<process::Output>;
// pub fn stansummary(&self, opts: Option<StanSummaryOptions>) -> io::Result<process::Output>

/// Write the console output to a file.  The path at which the
/// file will be created is determined in the following manner: If
/// `file` is `None`, `"log.txt"` is adjoined on to `cwd_at_call`.
/// If `file` is `Some(path)` and `path` is a relative path, then
/// `path` is adjoined on to `cwd_at_call`; otherwise, `path` is
/// assumed to be an absolute path and is used without
/// modification. Upon successful write, the path at which the
/// file was created is returned.
pub fn write_output<P: AsRef<Path>>(&self, file: Option<P>) -> io::Result<PathBuf>

I don't know if I would keep write_output.

The path helper is actually pretty easy to figure out if you have the argument tree and cache the path of the working directory at the point of call. If the user had any relative paths, then said paths must have been relative to the directory at which they made the call (else the call fails and we don't make it this far); if they used absolute paths, then everything is sensible.

edit: on reflection, I should not blithely underestimate the ability of users to create valid-but-spaghetti paths.

present to expose the context to the user and perform utility
functions (e.g. return a list of output file paths).

## Arguments and options

Stan provides several inference engines, each with a large number of
options. CmdStan in turn handles this heterogeneity.

To encapsulate the arguments passed at the command line (to a compiled
executable), the proposal is a strongly-typed representation of this
heterogeneity using a combination of sum types (Rust `enum`) and
product types (Rust `struct`). By construction, this representation
prevents the formation of inconsistent argument combinations -- the
code simply won't compile. The resultant tree is an abstraction which
enables the use of a single type (`CmdStanOutput`) to encapsulate a
call to an executable.

Unsurprisingly, the argument tree is a syntax tree for CmdStan
command arguments. We translate to the very simple command line
language, but leave open the possibility of translation to other
languages.

### Translation

The (sloppy) productions for the command line language are:
```text
tree -> terms
terms -> terms " " term | term
term -> pair | product | sum
pair -> key "=" value
product -> type " " pairs
sum -> type "=" variant " " terms | type "=" variant
pairs -> pairs " " pair | pair

key -> A
A -> A alpha | beta
alpha -> letter | digit | "_"
beta -> letter
letter -> "A" | ... | "z"
digit -> "0" | ... | "9"

value -> number | path
```
Where the productions for `number` and `path` are left out for
brevity. The start symbol is `tree`. Generate the command line
statement by folding the tree from left to right by generating the
appropriate term from each node, building up a linear argument list.
I sketched
[this](https://github.com/andrewjradcliffe/cmdstan-translator/blob/main/translate.scm)
out in Scheme, why I am not sure.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this section is necessary, providing an example of use and/or a list of expected structs may be more useful

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree -- this is at best a distraction.


### Ergonomics

Philosophy:
- pay (in LOC) for only what you need.
- minimize differences between naming of the types and fields (see
below) in the Rust implementation and CmdStan.

The builder pattern will be implemented for each `struct`, and for
each `enum` variant (excluding unit variants). This enables the user
to supply only the arguments for which they desire non-default
values. This leads to succinct code when one needs only the defaults
([example](https://github.com/andrewjradcliffe/cmdstan-rs/blob/main/examples/bernoulli-many/main.rs)).

#### A side effect of strong typing

With [company-mode](https://github.com/company-mode/company-mode) and
[eglot-mode](https://github.com/joaotavora/eglot)
([lsp-mode](https://github.com/emacs-lsp/lsp-mode/) also works) in
Emacs 28.2, one can view options at each node in the argument tree by
code that looks something like the following:

```rust
ArgumentTree::builder(). // hover on the `.`
```

If one has a Rust language server and completion support in their
editor, this is a free side effect. Whether it will help anyone
is uncertain.

### Coverage

The objective is for the interface to cover all options which can be
passed to a compiled Stan program, that is, all methods and all
options for said methods.

## Separation of concerns

Other than the argument tree support, the interface proposed is very
simple: the user can compile a Stan program, call it with arguments,
get basic information from the output, and call
`diagnose`/`stansummary`.

Below, I provide the rationale for exclusion of two aspects. My
judgment is that they are useful, but are best developed separately.

### Serialization

It is trivial to provide a function such as
```rust
fn write_json<T: Serialize>(data: &T, file: &Path) {}
```
but it serves no purpose -- it does not enforce the conventions
adopted for representing Stan data types in JSON (e.g. matrices
represented as a vector of *row* vectors, not a vector of *column*
vectors), hence, would likely lead to unexpected (and potentially
silent!) errors.

In order to develop a serializer which respects the conventions for
Stan data types, one would need to declare conventions for the mapping
of Rust data types to Stan data
types. [serde_json](https://github.com/serde-rs/json) would be nice to
use, but has some incompatibilities (Rust tuple is represented as an
array, rather than an object).

Moreover, to represent matrices, complex numbers, etc., one would need
to support types from the crate ecosystem since the standard library
lacks these -- [nalgebra](https://github.com/dimforge/nalgebra) and
[num-complex](https://github.com/rust-num/num-complex) are reasonable
choices, but nonetheless represent decisions to be made!

From a design perspective, this is a great place to defer to the user,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For ease of implementation this is true, but the list of above subtleties (to me) make a loud argument in the opposite direction: this is something hard, and the obvious approach will easily lead to incorrect results, so it should be handled by the library (eventually?)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alas, yes, it should be handled by the library. It could be gated by a feature, so as to delay the need for an implementation for the initial version. As additional dependencies would likely be introduced, a feature gate is probably appropriate anyway.

at least for the moment. A principled approach would involve writing a
data format for [serde](https://serde.rs/).

### Deserialization

Parsing Stan CSVs to a strongly-typed representation is simple if one
wishes to simply obtain a matrix of values (or `Vec<Vec<f64>>` if we
limit ourselves to `std` library types). However, one needs to extract
the variables from each line, thus, one needs to know the types (and
their dimensions). A recursive definition of types using `enum`s and
`struct`s could probably work to represent such a thing in Rust, but
may not necessarily be particularly ergonomic (i.e. much unavoidable
boilerplate would be needed to use the resultant type).

Procedural macros, applied to a Stan program stored in a string
literal in a Rust program, could be used to generate types and a
parser for Stan CSVs produced said program. However, in order to
Comment on lines +235 to +237
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a middle-ground between this (very cool, but tricky) idea and doing nothing, which is providing some primitives (which may themselves be macros) that allow the user to provide the information required about types and names of variables. Worth considering, and it seems like any future automation could build on it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The gate here is: decide on conventions for representation of Stan types as Rust types. nalgebra is quite similar to Eigen, including memory layout, so matrices are covered.

For arrays of dimension greater than 2, there is a problem.

  • Yes, there is ndarray, but its future is uncertain.
  • Nested Vec<Vec<Vec ... >>> naturally works, but I doubt users would like it (no syntactic sugar for indexing).

If we take the latter option (the only sensible course for stable development), then it would be relatively straightforward to use a procedural macro to derive a parser for a user-provided struct where the field names and respective types must match the Stan CSV output. I do something similar for my use cases.

implement such an idea, one would first need to adopt conventions for
representing Stan data types using Rust data types. This requires
careful thought and and is something best left for the future, if
ever.

The current proposal is for `CmdStanOutput` to be capable of returning
paths to the files and the user parses them however they desire.
This leaves open the possibility of multiple parsing strategies.

# Drawbacks
[drawbacks]: #drawbacks

Representing CmdStan arguments/options as a concrete syntax tree is
potentially brittle. If the CmdStan grammar undergoes radical change,
this interface will need to change accordingly. However, the CmdStan
grammar is intended to be quite stable. Moreover, it is not
necessarily the case that radical changes to the CmdStan grammar could
be hidden behind something other abstraction.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since breaking changes are discussed here, it seems like a reasonable place to ask about versioning: I assume the goal is to emulate cmdstanpy/r and be "mostly" cmdstan-version-independent? Or would a given version of this crate be specialized for a specific cmdstan version?

Version independence is nice in theory but IMO has some practical concerns which may lead one to reconsider the decisions make when working on cmdstanpy/r, especially when it comes to static checking. If cmdstan adds a new argument, supporting older versions simultaneously now requires the builder to do some runtime checks on top of what the Rust type system provided you already, which is a bummer.

Alternatively, the argument structs could themselves be versioned... but that seems like a pain.

Copy link
Author

@andrewjradcliffe andrewjradcliffe Dec 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, quite a few (maybe all?) of the structs and enums would need to be marked with #[non_exhaustive]; I was anticipating as much with the builder pattern.

If the Rust code stays up-to-date with the latest cmdstan arguments and options, then the worst case is that an older cmdstan will return something to the effect of "... is not a valid value for ..." in the stderr. An output is already not a guarantee (the process could die for any reason), thus, rather than leaving it to the user, we could provide a course classification of the errors (insufficient cmdstan version to support the user's argument tree being a variant).

I do not see an easy way to raise cmdstan capability checks to compile-time without resorting to strange things in a build script or procedural macro antics. Even then, the cmdstan installation that a user intends to use must be available at compile time.

Alternatively, one could package cmdstan as a Rust crate and manage it as a dependency with cargo. I do not recommend this due to the other headaches it would create.

edit: add link

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems reasonable! Saying as much in the document would be appreciated


# Rationale and alternatives
[rationale-and-alternatives]: #rationale-and-alternatives

Other than the direct representation of the CmdStan syntax tree, the
proposal contains nothing new. Utilizing a concrete representation of
the syntax tree does have benefits:
- all outputs handled via single type `CmdStanOutput`
- elimination of individual structures and methods for each of Stan's
inference algorithms
Comment on lines +264 to +265
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think eliminating these is probably a good idea, but maybe not one whose time has come. At the moment, the output files of e.g. ADVI should not be treated the same as those produced by sampling, because the first row is the mean of the approximation, not a draw. Therefore, something like picking a random row for initialization or even passing to stansummary is subtly different depending on algorithm.

One hopes that cmdstan will eventually have some nicer output conventions, such as putting the mean in its own file, but in the mean time the type safety of preventing you from doing something foolish could be nice.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thought was that with the argument tree cached in the output, special handling could be implemented there (e.g. match on the method). A reference to the cached value would be available to the user, thus enabling the same match-based approach for their downstream code.

On the other hand, it would be possible to move the common functionality to a trait and have one output type for each method; this would enable generic code for T: Output, but would likely require a Box<dyn Output>.

Another enum, mirroring the method variants in the argument tree, would likely be easier to reason about (rather than a boxed trait object). But if you have to match on output variant, it's the same in effect to match on the input variant.


The question remains: is this is a good idea?

## Argument tree considerations

As described above, the grammar for CmdStan arguments passed at the
command line can be represented as a syntax tree through the use of
sum and product types.
- This enables compile-time validation of argument consistency -- the
worst that can happen is you provide a value that CmdStan does not
like (e.g. `num_threads=-20`).
- At minimum, this will move a variety of run-time errors to compile
time; it might even help users to understand the methods and options
CmdStan provides.
- This enables re-use of a parameterized argument tree -- one could
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[comment] This is a very nice plus, IMO.

replace the inference method while leaving the other options
(e.g. data files) constant. As shown in [this
example](https://github.com/andrewjradcliffe/cmdstan-rs/blob/main/examples/bernoulli-many/main.rs),
such an approach can be quite expressive.

Furthermore, representation as a concrete syntax tree enables the
possibility of interesting features. One could parse the syntax tree
from:
- a string written to a log file
- a string which is consistent with the grammar that CmdStan accepts
Comment on lines +289 to +290
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also like to support reading from JSON; that should be relatively easy.

I got interested in PEGs, went a little overboard, and now there's a parser which is consistent with CmdStan; it supports a variety of input formats.


The latter is interesting in that a user's extant command line input
is all that is required to use the Rust interface. For example,
this leads to the following syntax:

```rust
// Assuming we implemented this through the `FromStr` trait
let tree: ArgumentTree = "method=sample data file=bernoulli.data.json".parse().unwrap();
```

This would substantially lower the barrier to adoption of the Rust
interface as the user need only know what they are already doing.

Due to Rust's orphan rules, such features would need to be implemented
within this crate; they could be placed behind a feature gate to
minimize compile time. It stands to reason that if we can translate to
a string, we should be able to perform the inverse operation.

The design philosophy here would be: a valid parse is whatever CmdStan
is willing to accept. However, CmdStan accepts some weird statements.
For example:
```bash
./bernoulli method=sample adapt engaged engaged=0 engaged engaged=1 gamma engaged gamma \
data file=bernoulli.data.json
```

The proposal is to use [pest](https://github.com/pest-parser/pest),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I am curious about is total dependencies (it may be worth calling this out as its own list/table somewhere).

Both Python and R's wrappers at least strive to be minimal, within what their standard libraries provide. I think the wonder of cargo making dependencies on the whole less painful has a downside of them tending to proliferate.

Could items like this be hidden behind a cargo feature? Continuing the only-pay-for-what-you-want mentality from earlier

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Things like MSRV may also be worth mentioning!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If not for this function, the MSRV would be quite a bit older.

Assuming that we put the parser behind a feature gate, the only remaining dependency is thiserror, which could be eliminated by writing out all the fmt::Display implementations (probably a couple hundred LOC).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have any objections to thiserror - it seems quite common in practice and rather small in terms of scope.

You bring up an interesting other point: I (sort of implicitly) expected that this wrapper could be written using entirely safe Rust. Is that unsafe function really needed, or is it called as an optimization?

What other usages of unsafe would you anticipate, and why?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The single unsafe function arose from the need to split the string on the occurrence of '.', with the objective of replicating CmdStan's interesting treatment of paths. This could easily be eliminated if we weren't dealing with Os-specific details (hidden behind WTF-8 in OsString).

CmdStan basically just splits at the first occurrence of '.', starting from the right.
This becomes a problem when the only thing to the left of the first '.' is the empty string and there the input matches any of the forms below.

./bernoulli id=2 method=sample num_chains=3 data file=bernoulli.data.json output file='..'
# produces as output file names: `._2.`, `._3.`, `._4.`

./bernoulli id=2 method=sample num_chains=3 data file=bernoulli.data.json output file=.foo
# produces as output file names: `_2.foo`, `_3.foo`, `_4.foo`

# this is actually a repeat of the above, with "foo" as the empty string
./bernoulli id=2 method=sample num_chains=3 data file=bernoulli.data.json output file='.'
# produces as output file names: `_2.`, `_3.`, `_4.`

It is not possible to easily treat these by two edge cases due to the definition of an extension.
I might be able to come up with another way to do it, but it will complicate the code accordingly.

I see no issue with unsafe which is limited to a very specific use case that is fundamentally sound. This use is particularly easy to verify soundness: the &[u8] slices are created by as_encoded_bytes and promptly consumed; when the byte slices are re-joined, it is impossible for either to have become invalid UTF-8.

It is worth noting that OsStr::from_encoded_bytes_unchecked only appears in two places; both appearances involve inputs which are created within the same function body, that is, no foreign inputs, no intermediates escape the scope in which they are created and said intermediates are consumed in the same function body. Incidentally, it is more efficient to play with byte slices, but that was not my objective.

Each use looks like:

let bytes = path.as_encoded_bytes();
let mut iter = bytes.rsplitn(2, |b| *b == b'.');
let (prefix, suffix) = match (iter.next(), iter.next()) {
    (Some(suffix), Some(prefix)) => {
        // SAFETY:
        // - each fragment only contains content that originated
        //   from `OsStr::as_encoded_bytes`.
        // - split with ASCII period, which is a non-empty UTF-8
        //   substring
        unsafe {
            (
                OsStr::from_encoded_bytes_unchecked(prefix),
                OsStr::from_encoded_bytes_unchecked(suffix),
            )
        }
    }
    _ => (path, "csv".as_ref()),
};

I general, my philosophy on use of unsafe is that it is best to avoid where possible; for a crate like this, I certainly wouldn't expose any unsafe public APIs. There's no reason to expose an unsafe internal API for this crate, as there is nothing that warrants it here.

I realize that my response above could actually be read as somewhat alarming -- potentially implies lots of unsafe. Limited to two locations (1, 2) with no plans to expand beyond these, I hope that the above is reassuring.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also sounds like something we will probably change (stan-dev/cmdstan#1213) - the current system is mostly designed-by-accident

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Your use of unsafe seem fine to me. However, since essentially the same unsafe code is used in two places I'd encapsulate it in a single safe-to-call helper function.
  2. This file extension handling corner case feels like a CmdStan bug and I wouldn't bother supporting it beyond a clear error message.
  3. It is quite frustrating that as_/from_encoded_bytes does not provide a safe API for splitting OsStrs. Nevertheless, in this case unsafe can be avoided with a couple of allocations:
use std::ffi::{OsString, OsStr};
let (prefix, suffix) = match (path.file_stem(), path.extension()) {
    (Some(prefix), Some(suffix)) => (prefix, suffix.to_os_string()),
    (Some(name), None) if name.as_encoded_bytes().first() == Some(&b'.') => {
        let mut fake_name = std::path::PathBuf::from("x");
        fake_name.as_mut_os_string().push(name);
        (OsStr::new(""), fake_name.extension().unwrap().to_os_string())
    }
    (Some(name), None) =>
        (name, OsString::from("csv")),
    (None, _) => {
        // No file name, which means the last path component is either "." or ".."
        if path.as_os_str().as_encoded_bytes().ends_with(b"..") {
            (OsStr::new("."), OsString::from(""))
        } else {
            (OsStr::new(""), OsString::from(""))
        }
    }
};
  1. Coming back to point (2), if the user specifies output file=out.d/file CmdStan tries to write to out_1.d/file etc but if the directory out_1.d does not exist CmdnStan won't create it and the output just silently disappears. That can't be right.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order:

  1. I completely agree. It should be moved to a single function.
  2. I agree again. I found this when I was writing tests for the Rust code. However, if CmdStan does it, it seems fine to support it if it is easy enough. Splitting WTF-8 encoded bytes on ASCII is fundamentally sound, so the unsafe-ness doesn't bother me. In fact, std uses the same approach; I had not looked at std until now.
  3. It is only the conversion back to OsStr that is unsafe, but I agree that a safe API would be nice. Unfortunately, literal split on the first period from the right is very hard to avoid. Your approach is clever, but is defeated by inputs such as below:
function bernoulli() {
    ./bernoulli data file=bernoulli.data.json sample num_chains=3 output file=$1
}

mkdir -p foo/bar
                            # files created by CmdStan
bernoulli 'foo/.bar'        # "foo/_1.bar", "foo/_2.bar", "foo/_3.bar"
bernoulli 'foo/bar/'        # "foo/bar/_1.csv", "foo/bar/_2.csv", "foo/bar/_3.csv"
bernoulli 'foo/bar/.'       # "foo/bar/_1.", "foo/bar/_2.", "foo/bar/_3."
bernoulli 'foo/bar/..'      # "foo/bar/._2.", "foo/bar/._3.", "foo/bar/._4."
bernoulli 'foo/bar/...'     # "foo/bar/.._2.", "foo/bar/.._3.", "foo/bar/.._4."

One might think that path::parent and path::file_name would reduce the problem to the code you gave, substituting the result of path::file_name, but that does not work due to path normalization. For example, foo/bar/. is equivalent to foo/bar when treated as a path; likewise for foo/bar/.

From a code complexity standpoint, splitting at the '.' byte is much simpler, as it does not require addition of more case logic, and matches what CmdStan does by default.

This is a good example of where unsafe is both easier to maintain and verify the correctness of. Using the std::path API quickly becomes cumbersome, as it performs the normalization of path components one would naturally expect; in essence, the logic of the program must now test for all the ways that a string with strange placements of '.' would be subject to path normalization, then avoid them.

  1. This was sufficiently interesting that I wrote it as an issue for the CmdStan repository.

Summary

Until CmdStan switches from splitting on the last dot, there is no reason to deviate from said behavior in the Rust treatment of paths. Path normalization is not trivial (see cppreference for a nice recapitulation of the steps). Effort spent on attempting to reproduce the CmdStan behavior via a convoluted use of Path and OsStr, on the basis that no code should be unsafe, is likely to lead to logical errors-- I am not keen on working out all pathological (no pun intended) behavior induced by lack of path normalization and splitting on the last '.'. It is much simpler and maintainable to verify that the invariants of an unsafe function are satisfied than to create a Rube Goldberg machine.

Examples like Case 2 demonstrate that such a machine would not be feasible, as it is not sufficient to split parent and child, check the respective start/end, and be done. One must scan the entire string for dots and this is only possible after as_bytes_encoded, hence, we are back to unsafe to put the halves back together (i.e. right back to the original proposal).

Furthermore, the discrepancy between CmdStan's behavior and the proposed interface would be a a potential source of confusion. If someone wants to use wacky file names and the program (CmdStan) accepts them, we would be remiss to impose artificial limitations.

rather than write a custom parser.

# Prior art
[prior-art]: #prior-art

I have used both the with CmdStanPy and the StanJulia suite of
packages. Years ago, I found them convenient.

## Flat structure

Both CmdStanPy and StanJulia pursue a flat structure. This works
largely due to the provision of optional positional/keyword arguments
in a dynamic language.

This is not possible in Rust -- default values require the builder
pattern in order to be ergonomic.

## Naming

The difference between naming of arguments/options in CmdStan and
(CmdStanPy | StanJulia) can be a source of confusion. I suppose that
one would not have this problem if one never used CmdStan.

## Serialization/deserialization of inputs/outputs

Undoubtedly, both CmdStanPy and the Julia suite are targeted at the
dynamic language audience, which expects features such as
serialization/deserialization to be built in. In general, I would
expect that Rust programmers would probably want to select their own
I/O options, thus, I do not see it as a downside to exclude such
features.