-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rust interface for CmdStan #51
base: master
Are you sure you want to change the base?
Changes from all commits
c323f1a
5d6456f
1cbf527
2b5a086
3f6dc99
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,348 @@ | ||
- Feature Name: cmdstanrs | ||
- Start Date: 2023-11-22 | ||
- RFC PR: | ||
- Stan Issue: | ||
|
||
# Summary | ||
[summary]: #summary | ||
|
||
This is a proposal for a Rust interface for CmdStan through compiled | ||
executables, that is, no direct interface with C++. | ||
|
||
The goal is to provide an interface which enables users to: | ||
- compile Stan programs (with arbitrary options) | ||
- build and compose arguments/options (to be passed to C++ | ||
executables) in idiomatic Rust | ||
- call C++ executables, then memoize input and collect output (thereby | ||
making these available for programmatic use) | ||
- call `diagnose` and `stansummary` tools and collect output | ||
|
||
The objective is to keep the interface as simple as possible. | ||
|
||
# Motivation | ||
[motivation]: #motivation | ||
|
||
Suppose that you write Rust code. Suppose that you use `Stan` for | ||
probabilistic programming. You have three choices for creating an | ||
application which utilizes both: shell scripts, introduce a scripting | ||
language (e.g. Python) dependency, or a Rust interface for `CmdStan`. | ||
|
||
Orchestration using the shell suffers from portability issues and | ||
often leads unnecessary fragmentation of control flow. Introducing a | ||
scripting language may confer portability, but control flow and error | ||
handling are now divided between two languages; furthermore, this | ||
necessitates code be written to serialize/deserialize intermediates. | ||
|
||
A Rust interface, in similar spirit to the CmdStan interfaces from | ||
other languages, would provide the necessary abstraction to eliminate | ||
the aforementioned problems. | ||
|
||
# Functional Specification | ||
|
||
Given a Stan program, a user of the library will compile the model (if | ||
desired), call the executable with arguments (translated from | ||
strongly-typed argument tree), and obtain a self-contained context | ||
which encapsulates the pertinent information from the call. | ||
|
||
### Assumptions | ||
|
||
We assume (at our peril) that Rust programmers that will be able to | ||
figure out how to satisfy the following requirement: | ||
- a working CmdStan installation exists at some user-accessible path | ||
|
||
### Processes, IO | ||
|
||
The proposal is to use the Rust `std` library, in particular the | ||
[process](https://doc.rust-lang.org/std/process/index.html), | ||
[path](https://doc.rust-lang.org/std/path/index.html), | ||
[fs](https://doc.rust-lang.org/std/fs/index.html), and | ||
[ffi](https://doc.rust-lang.org/std/ffi/index.html) modules, to | ||
orchestrate processes, interact with file system and handle | ||
cross-platform concerns. This will yield a library which is portable, | ||
provided that it is (cross-)compiled for the intended target. | ||
|
||
## Control: compilation and calling the resultant executable | ||
|
||
A `CmdStanModel` type will serve as an abstraction for a Stan program, | ||
which may need to be compiled. Rather than compile on construction, a | ||
compilation method must be explicitly called in user code (assuming | ||
that a satisfactory executable does not exist yet). | ||
Comment on lines
+66
to
+69
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This model is something CmdStanPy is moving away from, because a CmdStanModel is pretty useless without an executable. In a strongly typed context, one could imagine that this type could instead be called something like There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Excellent idea. As is often the case, putting a name to a common idiom (the typestate pattern) is illuminating. |
||
|
||
Two arguments will be necessary to create a `CmdStanModel`: | ||
1. a path to a CmdStan installation | ||
2. a path to a Stan program | ||
|
||
Methods (receiver: `CmdStanModel` instance) exposed to the user will | ||
include: | ||
- `validate_cmdstan` : determine whether the CmdStan installation works | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Both this and passing the path to the CmdStan installation feel like they're concerns separate from a CmdStanModel instance to some extent. Does it make sense to have a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agreed. The path to the cmdstan installation going through #[derive(Clone)]
pub struct CmdStanInstallation { path: PathBuf }
pub struct StanProgram { path: PathBuf }
impl CmdStanInstallation {
fn validate(&self) -> io::Result<bool>; // you can break the install by deleting it
pub fn compile(&self, prog: &StanProgram) -> Result<CmdStanModel, CompilationError>;
pub fn try_from_env() -> io::Result<Self>; // we look for `CMDSTAN` in environment
}
impl TryFrom<&Path> for CmdStanInstallation { /* validate CmdStan works, else error */ } Using an environment variable is a reasonably consistent way to set a global without committing to a lazily-initialized global in Rust. Admittedly, this just moves the problem elsewhere, but it is clear what you would be getting -- the current CmdStan, as determined by a specific environment variable. I am not fond of user-facing globals (unless they truly are just static data, e.g. color schemes). If the global depends on something a user is to provide, it gets messy. If we provide an interface that requires a validated CmdStan installation, then it's up to the user to figure out how to meet said requirement. I don't see much harm placing the complexity on the user -- after all, it becomes our problem if it is ambiguous as to which CmdStan should be used at any given point. If a user really wanted to do it via global, the code below is an option. static CMDSTAN: once_cell::sync::Lazy<PathBuf> = once_cell::sync::Lazy::new(|| infallible_go_get_cmdstan());
fn my_cmdstan_install() -> CmdStanInstallation {
let path: &Path = &*CMDSTAN;
CmdStanInstallation::try_from(path).unwrap()
} |
||
- `executable_works` : is there a working executable at the path | ||
implied by the Stan file? | ||
- `compile_with_args` : attempt to compile a Stan program with | ||
optional `make` arguments | ||
- `call_executable` : call executable with the given argument tree; on | ||
success, return a `CmdStanOutput` instance. | ||
|
||
## Output | ||
|
||
Output of a successful `call_executable` call on a `CmdStanModel` will | ||
produce a `CmdStanOuput` instance, which encapsulates the context of | ||
the call. This includes: | ||
- the console output (exit status, stdout and stderr), | ||
- the argument tree provided to `call_executable` | ||
- the current working directory of the process at the time the call was made | ||
- the CmdStan installation from the parent `CmdStanModel` | ||
|
||
The objective is for `CmdStanOutput` to be a self-contained record | ||
which includes all pertinent information from a successful executable | ||
call. This structure can then be used to direct calls of | ||
`diagnose`/`stansummary`. Naturally, methods on said type will be | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you supply a complete list of signatures for methods on There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Public signatures are (at time of writing) /// Return the output files associated with the `CmdStan` call.
pub fn output_files(&self) -> Vec<PathBuf>;
/// Return the diagnostic files associated with the `CmdStan` call.
pub fn diagnostic_files(&self) -> Vec<PathBuf>;
/// Return the profile files associated with the `CmdStan` call.
pub fn profile_files(&self) -> Vec<PathBuf>;
/// Return an immutable reference to console output associated
/// with the `CmdStan` call.
pub fn output(&self) -> &process::Output;
/// Return a reference to the current working directory at the
/// time of the call.
pub fn cwd_at_call(&self) -> &Path;
/// Return a reference to the argument tree with which the call
/// was made.
pub fn argument_tree(&self) -> &ArgumentTree;
// EDIT: on second thought, no. The next 3 methods introduce an unnecessary
// coupling of state.
// pub fn cmdstan(&self) -> &Path
// pub fn diagnose(&self) -> io::Result<process::Output>;
// pub fn stansummary(&self, opts: Option<StanSummaryOptions>) -> io::Result<process::Output>
/// Write the console output to a file. The path at which the
/// file will be created is determined in the following manner: If
/// `file` is `None`, `"log.txt"` is adjoined on to `cwd_at_call`.
/// If `file` is `Some(path)` and `path` is a relative path, then
/// `path` is adjoined on to `cwd_at_call`; otherwise, `path` is
/// assumed to be an absolute path and is used without
/// modification. Upon successful write, the path at which the
/// file was created is returned.
pub fn write_output<P: AsRef<Path>>(&self, file: Option<P>) -> io::Result<PathBuf> I don't know if I would keep The path helper is actually pretty easy to figure out if you have the argument tree and cache the path of the working directory at the point of call. If the user had any relative paths, then said paths must have been relative to the directory at which they made the call (else the call fails and we don't make it this far); if they used absolute paths, then everything is sensible. edit: on reflection, I should not blithely underestimate the ability of users to create valid-but-spaghetti paths. |
||
present to expose the context to the user and perform utility | ||
functions (e.g. return a list of output file paths). | ||
|
||
## Arguments and options | ||
|
||
Stan provides several inference engines, each with a large number of | ||
options. CmdStan in turn handles this heterogeneity. | ||
|
||
To encapsulate the arguments passed at the command line (to a compiled | ||
executable), the proposal is a strongly-typed representation of this | ||
heterogeneity using a combination of sum types (Rust `enum`) and | ||
product types (Rust `struct`). By construction, this representation | ||
prevents the formation of inconsistent argument combinations -- the | ||
code simply won't compile. The resultant tree is an abstraction which | ||
enables the use of a single type (`CmdStanOutput`) to encapsulate a | ||
call to an executable. | ||
|
||
Unsurprisingly, the argument tree is a syntax tree for CmdStan | ||
command arguments. We translate to the very simple command line | ||
language, but leave open the possibility of translation to other | ||
languages. | ||
|
||
### Translation | ||
|
||
The (sloppy) productions for the command line language are: | ||
```text | ||
tree -> terms | ||
terms -> terms " " term | term | ||
term -> pair | product | sum | ||
pair -> key "=" value | ||
product -> type " " pairs | ||
sum -> type "=" variant " " terms | type "=" variant | ||
pairs -> pairs " " pair | pair | ||
|
||
key -> A | ||
A -> A alpha | beta | ||
alpha -> letter | digit | "_" | ||
beta -> letter | ||
letter -> "A" | ... | "z" | ||
digit -> "0" | ... | "9" | ||
|
||
value -> number | path | ||
``` | ||
Where the productions for `number` and `path` are left out for | ||
brevity. The start symbol is `tree`. Generate the command line | ||
statement by folding the tree from left to right by generating the | ||
appropriate term from each node, building up a linear argument list. | ||
I sketched | ||
[this](https://github.com/andrewjradcliffe/cmdstan-translator/blob/main/translate.scm) | ||
out in Scheme, why I am not sure. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure this section is necessary, providing an example of use and/or a list of expected structs may be more useful There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree -- this is at best a distraction. |
||
|
||
### Ergonomics | ||
|
||
Philosophy: | ||
- pay (in LOC) for only what you need. | ||
- minimize differences between naming of the types and fields (see | ||
below) in the Rust implementation and CmdStan. | ||
|
||
The builder pattern will be implemented for each `struct`, and for | ||
each `enum` variant (excluding unit variants). This enables the user | ||
to supply only the arguments for which they desire non-default | ||
values. This leads to succinct code when one needs only the defaults | ||
([example](https://github.com/andrewjradcliffe/cmdstan-rs/blob/main/examples/bernoulli-many/main.rs)). | ||
|
||
#### A side effect of strong typing | ||
|
||
With [company-mode](https://github.com/company-mode/company-mode) and | ||
[eglot-mode](https://github.com/joaotavora/eglot) | ||
([lsp-mode](https://github.com/emacs-lsp/lsp-mode/) also works) in | ||
Emacs 28.2, one can view options at each node in the argument tree by | ||
code that looks something like the following: | ||
|
||
```rust | ||
ArgumentTree::builder(). // hover on the `.` | ||
``` | ||
|
||
If one has a Rust language server and completion support in their | ||
editor, this is a free side effect. Whether it will help anyone | ||
is uncertain. | ||
|
||
### Coverage | ||
|
||
The objective is for the interface to cover all options which can be | ||
passed to a compiled Stan program, that is, all methods and all | ||
options for said methods. | ||
|
||
## Separation of concerns | ||
|
||
Other than the argument tree support, the interface proposed is very | ||
simple: the user can compile a Stan program, call it with arguments, | ||
get basic information from the output, and call | ||
`diagnose`/`stansummary`. | ||
|
||
Below, I provide the rationale for exclusion of two aspects. My | ||
judgment is that they are useful, but are best developed separately. | ||
|
||
### Serialization | ||
|
||
It is trivial to provide a function such as | ||
```rust | ||
fn write_json<T: Serialize>(data: &T, file: &Path) {} | ||
``` | ||
but it serves no purpose -- it does not enforce the conventions | ||
adopted for representing Stan data types in JSON (e.g. matrices | ||
represented as a vector of *row* vectors, not a vector of *column* | ||
vectors), hence, would likely lead to unexpected (and potentially | ||
silent!) errors. | ||
|
||
In order to develop a serializer which respects the conventions for | ||
Stan data types, one would need to declare conventions for the mapping | ||
of Rust data types to Stan data | ||
types. [serde_json](https://github.com/serde-rs/json) would be nice to | ||
use, but has some incompatibilities (Rust tuple is represented as an | ||
array, rather than an object). | ||
|
||
Moreover, to represent matrices, complex numbers, etc., one would need | ||
to support types from the crate ecosystem since the standard library | ||
lacks these -- [nalgebra](https://github.com/dimforge/nalgebra) and | ||
[num-complex](https://github.com/rust-num/num-complex) are reasonable | ||
choices, but nonetheless represent decisions to be made! | ||
|
||
From a design perspective, this is a great place to defer to the user, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For ease of implementation this is true, but the list of above subtleties (to me) make a loud argument in the opposite direction: this is something hard, and the obvious approach will easily lead to incorrect results, so it should be handled by the library (eventually?) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Alas, yes, it should be handled by the library. It could be gated by a feature, so as to delay the need for an implementation for the initial version. As additional dependencies would likely be introduced, a feature gate is probably appropriate anyway. |
||
at least for the moment. A principled approach would involve writing a | ||
data format for [serde](https://serde.rs/). | ||
|
||
### Deserialization | ||
|
||
Parsing Stan CSVs to a strongly-typed representation is simple if one | ||
wishes to simply obtain a matrix of values (or `Vec<Vec<f64>>` if we | ||
limit ourselves to `std` library types). However, one needs to extract | ||
the variables from each line, thus, one needs to know the types (and | ||
their dimensions). A recursive definition of types using `enum`s and | ||
`struct`s could probably work to represent such a thing in Rust, but | ||
may not necessarily be particularly ergonomic (i.e. much unavoidable | ||
boilerplate would be needed to use the resultant type). | ||
|
||
Procedural macros, applied to a Stan program stored in a string | ||
literal in a Rust program, could be used to generate types and a | ||
parser for Stan CSVs produced said program. However, in order to | ||
Comment on lines
+235
to
+237
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There is a middle-ground between this (very cool, but tricky) idea and doing nothing, which is providing some primitives (which may themselves be macros) that allow the user to provide the information required about types and names of variables. Worth considering, and it seems like any future automation could build on it. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The gate here is: decide on conventions for representation of Stan types as Rust types. For arrays of dimension greater than 2, there is a problem.
If we take the latter option (the only sensible course for stable development), then it would be relatively straightforward to use a procedural macro to derive a parser for a user-provided struct where the field names and respective types must match the Stan CSV output. I do something similar for my use cases. |
||
implement such an idea, one would first need to adopt conventions for | ||
representing Stan data types using Rust data types. This requires | ||
careful thought and and is something best left for the future, if | ||
ever. | ||
|
||
The current proposal is for `CmdStanOutput` to be capable of returning | ||
paths to the files and the user parses them however they desire. | ||
This leaves open the possibility of multiple parsing strategies. | ||
|
||
# Drawbacks | ||
[drawbacks]: #drawbacks | ||
|
||
Representing CmdStan arguments/options as a concrete syntax tree is | ||
potentially brittle. If the CmdStan grammar undergoes radical change, | ||
this interface will need to change accordingly. However, the CmdStan | ||
grammar is intended to be quite stable. Moreover, it is not | ||
necessarily the case that radical changes to the CmdStan grammar could | ||
be hidden behind something other abstraction. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Since breaking changes are discussed here, it seems like a reasonable place to ask about versioning: I assume the goal is to emulate cmdstanpy/r and be "mostly" cmdstan-version-independent? Or would a given version of this crate be specialized for a specific cmdstan version? Version independence is nice in theory but IMO has some practical concerns which may lead one to reconsider the decisions make when working on cmdstanpy/r, especially when it comes to static checking. If cmdstan adds a new argument, supporting older versions simultaneously now requires the builder to do some runtime checks on top of what the Rust type system provided you already, which is a bummer. Alternatively, the argument structs could themselves be versioned... but that seems like a pain. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Technically, quite a few (maybe all?) of the structs and enums would need to be marked with If the Rust code stays up-to-date with the latest cmdstan arguments and options, then the worst case is that an older cmdstan will return something to the effect of "... is not a valid value for ..." in the stderr. An output is already not a guarantee (the process could die for any reason), thus, rather than leaving it to the user, we could provide a course classification of the errors (insufficient cmdstan version to support the user's argument tree being a variant). I do not see an easy way to raise cmdstan capability checks to compile-time without resorting to strange things in a build script or procedural macro antics. Even then, the cmdstan installation that a user intends to use must be available at compile time. Alternatively, one could package cmdstan as a Rust crate and manage it as a dependency with cargo. I do not recommend this due to the other headaches it would create. edit: add link There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That seems reasonable! Saying as much in the document would be appreciated |
||
|
||
# Rationale and alternatives | ||
[rationale-and-alternatives]: #rationale-and-alternatives | ||
|
||
Other than the direct representation of the CmdStan syntax tree, the | ||
proposal contains nothing new. Utilizing a concrete representation of | ||
the syntax tree does have benefits: | ||
- all outputs handled via single type `CmdStanOutput` | ||
- elimination of individual structures and methods for each of Stan's | ||
inference algorithms | ||
Comment on lines
+264
to
+265
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think eliminating these is probably a good idea, but maybe not one whose time has come. At the moment, the output files of e.g. ADVI should not be treated the same as those produced by sampling, because the first row is the mean of the approximation, not a draw. Therefore, something like picking a random row for initialization or even passing to One hopes that cmdstan will eventually have some nicer output conventions, such as putting the mean in its own file, but in the mean time the type safety of preventing you from doing something foolish could be nice. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My thought was that with the argument tree cached in the output, special handling could be implemented there (e.g. On the other hand, it would be possible to move the common functionality to a trait and have one output type for each method; this would enable generic code for Another enum, mirroring the method variants in the argument tree, would likely be easier to reason about (rather than a boxed trait object). But if you have to |
||
|
||
The question remains: is this is a good idea? | ||
|
||
## Argument tree considerations | ||
|
||
As described above, the grammar for CmdStan arguments passed at the | ||
command line can be represented as a syntax tree through the use of | ||
sum and product types. | ||
- This enables compile-time validation of argument consistency -- the | ||
worst that can happen is you provide a value that CmdStan does not | ||
like (e.g. `num_threads=-20`). | ||
- At minimum, this will move a variety of run-time errors to compile | ||
time; it might even help users to understand the methods and options | ||
CmdStan provides. | ||
- This enables re-use of a parameterized argument tree -- one could | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. [comment] This is a very nice plus, IMO. |
||
replace the inference method while leaving the other options | ||
(e.g. data files) constant. As shown in [this | ||
example](https://github.com/andrewjradcliffe/cmdstan-rs/blob/main/examples/bernoulli-many/main.rs), | ||
such an approach can be quite expressive. | ||
|
||
Furthermore, representation as a concrete syntax tree enables the | ||
possibility of interesting features. One could parse the syntax tree | ||
from: | ||
- a string written to a log file | ||
- a string which is consistent with the grammar that CmdStan accepts | ||
Comment on lines
+289
to
+290
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Or, one hopes, the JSON file cmdstan itself can now produce There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would also like to support reading from JSON; that should be relatively easy. I got interested in PEGs, went a little overboard, and now there's a parser which is consistent with CmdStan; it supports a variety of input formats. |
||
|
||
The latter is interesting in that a user's extant command line input | ||
is all that is required to use the Rust interface. For example, | ||
this leads to the following syntax: | ||
|
||
```rust | ||
// Assuming we implemented this through the `FromStr` trait | ||
let tree: ArgumentTree = "method=sample data file=bernoulli.data.json".parse().unwrap(); | ||
``` | ||
|
||
This would substantially lower the barrier to adoption of the Rust | ||
interface as the user need only know what they are already doing. | ||
|
||
Due to Rust's orphan rules, such features would need to be implemented | ||
within this crate; they could be placed behind a feature gate to | ||
minimize compile time. It stands to reason that if we can translate to | ||
a string, we should be able to perform the inverse operation. | ||
|
||
The design philosophy here would be: a valid parse is whatever CmdStan | ||
is willing to accept. However, CmdStan accepts some weird statements. | ||
For example: | ||
```bash | ||
./bernoulli method=sample adapt engaged engaged=0 engaged engaged=1 gamma engaged gamma \ | ||
data file=bernoulli.data.json | ||
``` | ||
|
||
The proposal is to use [pest](https://github.com/pest-parser/pest), | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. One thing I am curious about is total dependencies (it may be worth calling this out as its own list/table somewhere). Both Python and R's wrappers at least strive to be minimal, within what their standard libraries provide. I think the wonder of Could items like this be hidden behind a cargo feature? Continuing the only-pay-for-what-you-want mentality from earlier There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Things like MSRV may also be worth mentioning! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If not for this function, the MSRV would be quite a bit older. Assuming that we put the parser behind a feature gate, the only remaining dependency is There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't have any objections to You bring up an interesting other point: I (sort of implicitly) expected that this wrapper could be written using entirely safe Rust. Is that What other usages of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The single unsafe function arose from the need to split the string on the occurrence of CmdStan basically just splits at the first occurrence of ./bernoulli id=2 method=sample num_chains=3 data file=bernoulli.data.json output file='..'
# produces as output file names: `._2.`, `._3.`, `._4.`
./bernoulli id=2 method=sample num_chains=3 data file=bernoulli.data.json output file=.foo
# produces as output file names: `_2.foo`, `_3.foo`, `_4.foo`
# this is actually a repeat of the above, with "foo" as the empty string
./bernoulli id=2 method=sample num_chains=3 data file=bernoulli.data.json output file='.'
# produces as output file names: `_2.`, `_3.`, `_4.` It is not possible to easily treat these by two edge cases due to the definition of an extension. I see no issue with It is worth noting that Each use looks like: let bytes = path.as_encoded_bytes();
let mut iter = bytes.rsplitn(2, |b| *b == b'.');
let (prefix, suffix) = match (iter.next(), iter.next()) {
(Some(suffix), Some(prefix)) => {
// SAFETY:
// - each fragment only contains content that originated
// from `OsStr::as_encoded_bytes`.
// - split with ASCII period, which is a non-empty UTF-8
// substring
unsafe {
(
OsStr::from_encoded_bytes_unchecked(prefix),
OsStr::from_encoded_bytes_unchecked(suffix),
)
}
}
_ => (path, "csv".as_ref()),
}; I general, my philosophy on use of I realize that my response above could actually be read as somewhat alarming -- potentially implies lots of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This also sounds like something we will probably change (stan-dev/cmdstan#1213) - the current system is mostly designed-by-accident There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
use std::ffi::{OsString, OsStr};
let (prefix, suffix) = match (path.file_stem(), path.extension()) {
(Some(prefix), Some(suffix)) => (prefix, suffix.to_os_string()),
(Some(name), None) if name.as_encoded_bytes().first() == Some(&b'.') => {
let mut fake_name = std::path::PathBuf::from("x");
fake_name.as_mut_os_string().push(name);
(OsStr::new(""), fake_name.extension().unwrap().to_os_string())
}
(Some(name), None) =>
(name, OsString::from("csv")),
(None, _) => {
// No file name, which means the last path component is either "." or ".."
if path.as_os_str().as_encoded_bytes().ends_with(b"..") {
(OsStr::new("."), OsString::from(""))
} else {
(OsStr::new(""), OsString::from(""))
}
}
};
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In order:
function bernoulli() {
./bernoulli data file=bernoulli.data.json sample num_chains=3 output file=$1
}
mkdir -p foo/bar
# files created by CmdStan
bernoulli 'foo/.bar' # "foo/_1.bar", "foo/_2.bar", "foo/_3.bar"
bernoulli 'foo/bar/' # "foo/bar/_1.csv", "foo/bar/_2.csv", "foo/bar/_3.csv"
bernoulli 'foo/bar/.' # "foo/bar/_1.", "foo/bar/_2.", "foo/bar/_3."
bernoulli 'foo/bar/..' # "foo/bar/._2.", "foo/bar/._3.", "foo/bar/._4."
bernoulli 'foo/bar/...' # "foo/bar/.._2.", "foo/bar/.._3.", "foo/bar/.._4." One might think that From a code complexity standpoint, splitting at the This is a good example of where
SummaryUntil CmdStan switches from splitting on the last dot, there is no reason to deviate from said behavior in the Rust treatment of paths. Path normalization is not trivial (see cppreference for a nice recapitulation of the steps). Effort spent on attempting to reproduce the CmdStan behavior via a convoluted use of Examples like Case 2 demonstrate that such a machine would not be feasible, as it is not sufficient to split parent and child, check the respective start/end, and be done. One must scan the entire string for dots and this is only possible after Furthermore, the discrepancy between CmdStan's behavior and the proposed interface would be a a potential source of confusion. If someone wants to use wacky file names and the program (CmdStan) accepts them, we would be remiss to impose artificial limitations. |
||
rather than write a custom parser. | ||
|
||
# Prior art | ||
[prior-art]: #prior-art | ||
|
||
I have used both the with CmdStanPy and the StanJulia suite of | ||
packages. Years ago, I found them convenient. | ||
|
||
## Flat structure | ||
|
||
Both CmdStanPy and StanJulia pursue a flat structure. This works | ||
largely due to the provision of optional positional/keyword arguments | ||
in a dynamic language. | ||
|
||
This is not possible in Rust -- default values require the builder | ||
pattern in order to be ergonomic. | ||
|
||
## Naming | ||
|
||
The difference between naming of arguments/options in CmdStan and | ||
(CmdStanPy | StanJulia) can be a source of confusion. I suppose that | ||
one would not have this problem if one never used CmdStan. | ||
|
||
## Serialization/deserialization of inputs/outputs | ||
|
||
Undoubtedly, both CmdStanPy and the Julia suite are targeted at the | ||
dynamic language audience, which expects features such as | ||
serialization/deserialization to be built in. In general, I would | ||
expect that Rust programmers would probably want to select their own | ||
I/O options, thus, I do not see it as a downside to exclude such | ||
features. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a few more exotic options as well:
stan/services
APIsbridgestan
, as is done in e.g.nutpie
These may be worth mentioning in the Rationale and Alternatives, as both have their own downsides as well.