Best format of input files? #85

gassmoeller · 2024-11-14T16:45:23Z

This is a slightly difficult issue to write, because I need to strike a balance between the purpose of this project (which is good and I support) and one of the fundamental decisions you made when creating the project (which I like to initiate a discussion about). I would like to discuss and get your opinion about the format of the input files that you chose.

Just as a basis for discussion, this is the format you chose:

"""Set the full path to the flow executable and flags"""
mpirun -np 33 flow --tolerance-mb=1e-7 --linear-solver=cprw --newton-min-iterations=1 --enable-tuning=true --enable-opm-rst-file=true --output-extra-convergence-info=steps,iterations

"""Set the model parameters"""
spe11a master     #Name of the spe case (spe11a, spe11b, or spe11c) and OPM Flow version (master or release)
complete gaswater #Name of the co2 model (immiscible, convective [convective requires a Flow version newer than 22-08-2024], or complete) and co2store implementation (gaswater or gasoil [oil properties are set to water internally in OPM flow])
cartesian         #Type of grid (cartesian, tensor, or corner-point)
2.8 0.01 1.2      #Length, width, and depth [m]
280               #If cartesian, number of x cells [-]; otherwise, variable array of x-refinment
1                 #If cartesian, number of y cells [-]; otherwise, variable array of y-refinment [-] (for spe11c)
120               #If cartesian, number of z cells [-]; if tensor, variable array of z-refinment; if corner-point, fix array of z-refinment (18 entries)
20 20             #Temperature bottom and top rig [C]
1.2 110000 1      #Datum [m], pressure at the datum [Pa], and multiplier for the permeability in the z direction [-] 
1e-9 1.6e-5       #Diffusion (in liquid and gas) [m^2/s]
0 0               #Rock specific heat and density (for spe11b/c)
0 0 0             #Added pore volume on top boundary (for spe11a [if 0, free flow bc]), pore volume on lateral boundaries, and width of buffer cell [m] (for spe11b/c)
0 0               #Elevation of the parabola and back boundary [m] (for spe11c)

"""Set the saturation functions"""
(max(0, (s_w - swi) / (1 - swi))) ** 2                                                        #Wetting rel perm saturation function [-]
(max(0, (1 - s_w - sni) / (1 - sni))) ** 2                                                    #Non-wetting rel perm saturation function [-]
penmax * math.erf(pen * ((s_w-swi) / (1.-swi)) ** (-(1.0 / 2)) * math.pi**0.5 / (penmax * 2)) #Capillary pressure saturation function [Pa]
(np.exp(np.flip(np.linspace(0, 5.0, npoints))) - 1) / (np.exp(5.0) - 1)                       #Points to evaluate the saturation functions (s_w) [-]

Just from a data perspective, this is a:

line-based plain-text format
supporting comments (#)
with custom markup elements ("""), I am actually unclear if they serve a purpose or are comments
only containing values, not key-value pairs (e.g. x=y)
since there are no keys it seems to depend on the order of arguments
without a hierarchy (there are no sections, unless the """ stands for that)

Such a type of input format has a number of serious challenges:

it is not versioned (i.e. you cannot tell when and for what version of pyopenspe you made that file)
it is not backward compatible (old files will not work with new versions of pyopenspe if the number or order of parameters change)
it is not expressive (without the comments I have no idea what each line means)
it is not readable by standard input parser libraries (e.g. configparser, or argparse), instead you wrote your own parser (in utils/inputvalues.py)
writing your own parser is actually pretty hard to get correct, e.g. I think the following is true:
- your parser does not fail gracefully if I do not follow the format (e.g. put in the wrong number of lines, put in too many arguments on one line, put in the wrong type, put values outside of ranges)
- your parser does not support default arguments, specifying the expected type of a value, specifying the expected data range of a value, a documentation string that belongs to a key-value pair, other types of information about the value (e.g. if it is deprecated and will be removed)

I am not pointing this out, because I want to diminish your effort. Constructing data formats and writing parsers is just hard (e.g. one of the projects I contribute to that uses a custom parser has a parser of 2500 lines of C++ code), which is why many state-of-the-art software makes use of existing file formats and established parser libraries that can read them.

Currently the following formats are widely in-use:

TOML (a simple and yet pretty complete format, https://en.wikipedia.org/wiki/TOML)
INI (an older/worse version of TOML, widely used in Windows OS)
YAML (very versatile, still easy to read and write, https://en.wikipedia.org/wiki/YAML)
JSON (has best support for schemas -- a way to specify what input parameters are allowed, https://en.wikipedia.org/wiki/JSON)

E.g. the input file listed above, but converted to TOML could look like:

# Set the full path to the flow executable and flags
flow_exectuable = "mpirun -np 33 flow --tolerance-mb=1e-7 --linear-solver=cprw --newton-min-iterations=1 --enable-tuning=true --enable-opm-rst-file=true --output-extra-convergence-info=steps,iterations"

[Model parameters]
spe_case = "spe11a master"     #Name of the spe case (spe11a, spe11b, or spe11c) and OPM Flow version = "version" (master or release)
co2_model = ["complete", "gaswater"] #Name of the co2 model (immiscible, convective [convective requires a Flow version newer than 22-08-2024], or complete) and co2store implementation (gaswater or gasoil [oil properties are set to water internally in OPM flow])
grid = "cartesian"         #Type of grid (cartesian, tensor, or corner-point)
grid_length = [2.8, 0.01, 1.2]      #Length, width, and depth [m]
...

[Saturation functions]
wetting_permeability_function = "(max(0, (s_w - swi) / (1 - swi))) ** 2"
...

And this file could be read into a python dictionary as simple as:

import tomllib

with open("spe11b.toml", "rb") as f:
    data = tomllib.load(f)

You can then check that all necessary entries in the dict exist with the correct type, or create default values if parameters are not given in the file.

I understand that this is one of the basic assumptions about the user interface of pyopenspe11 and you already distributed this to a number of users, but I would like to at least urge you to consider to switch a better data format. You will make your own life and that of your users unbelievably hard with the current data format, and better formats are available and easy to use / implement. Additionally using standard data formats is one of the guiding principles of FAIR research software (https://www.go-fair.org/fair-principles/, I1. (Meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation.](https://www.go-fair.org/fair-principles/i1-metadata-use-formal-accessible-shared-broadly-applicable-language-knowledge-representation/).

Related to openjournals/joss-reviews#7357.

The text was updated successfully, but these errors were encountered:

daavid00 · 2024-12-03T15:44:06Z

Thanks @gassmoeller for this extensive issue.

We totally agree with you, that widely-use formats for parsing input as the ones you have mentioned should be used instead of writing own parsers. The main reason we adopted this approach was practicality, i.e., no need to define the Python variable names (i.e., less characters in the file), and only change values using the provided configuration files in the examples folder.

Then, to address this issue and comply with the guiding principles of FAIR research software you mentioned, the coming PR will add support for TOML files, i.e., we will keep the existing parsing for compatibility with previous configuration files, while TOML format will be used for input files with toml extension, e.g.,spe11b.toml.

daavid00 · 2024-12-04T19:41:52Z

Support for toml input files has been added in #102, thanks again @gassmoeller.

gassmoeller · 2024-12-20T09:29:16Z

Thanks for the quick implementation, I am glad you picked up the idea. 🎉

we will keep the existing parsing for compatibility with previous configuration files, while TOML format will be used for input files with toml extension, e.g.,spe11b.toml

I understand the wish to stay compatible with existing input file. I would recommend to then clearly state in the documentation that the text based format is frozen/deprecated with some message, and after a reasonable time ultimately remove the text format. Keeping two input formats of which one is increasingly outdated will split your user base and eventually confuse new users. This is something for the future of course.

gassmoeller mentioned this issue Nov 15, 2024

[REVIEW]: pyopmspe11: A Python framework using OPM Flow for the SPE11 benchmark project openjournals/joss-reviews#7357

Open

daavid00 mentioned this issue Dec 4, 2024

Support for toml input files #102

Merged

gassmoeller closed this as completed Dec 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best format of input files? #85

Best format of input files? #85

gassmoeller commented Nov 14, 2024

daavid00 commented Dec 3, 2024

daavid00 commented Dec 4, 2024

gassmoeller commented Dec 20, 2024

Best format of input files? #85

Best format of input files? #85

Comments

gassmoeller commented Nov 14, 2024

daavid00 commented Dec 3, 2024

daavid00 commented Dec 4, 2024

gassmoeller commented Dec 20, 2024