Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Best format of input files? #85

Closed
gassmoeller opened this issue Nov 14, 2024 · 3 comments
Closed

Best format of input files? #85

gassmoeller opened this issue Nov 14, 2024 · 3 comments

Comments

@gassmoeller
Copy link
Contributor

This is a slightly difficult issue to write, because I need to strike a balance between the purpose of this project (which is good and I support) and one of the fundamental decisions you made when creating the project (which I like to initiate a discussion about). I would like to discuss and get your opinion about the format of the input files that you chose.

Just as a basis for discussion, this is the format you chose:

"""Set the full path to the flow executable and flags"""
mpirun -np 33 flow --tolerance-mb=1e-7 --linear-solver=cprw --newton-min-iterations=1 --enable-tuning=true --enable-opm-rst-file=true --output-extra-convergence-info=steps,iterations

"""Set the model parameters"""
spe11a master     #Name of the spe case (spe11a, spe11b, or spe11c) and OPM Flow version (master or release)
complete gaswater #Name of the co2 model (immiscible, convective [convective requires a Flow version newer than 22-08-2024], or complete) and co2store implementation (gaswater or gasoil [oil properties are set to water internally in OPM flow])
cartesian         #Type of grid (cartesian, tensor, or corner-point)
2.8 0.01 1.2      #Length, width, and depth [m]
280               #If cartesian, number of x cells [-]; otherwise, variable array of x-refinment
1                 #If cartesian, number of y cells [-]; otherwise, variable array of y-refinment [-] (for spe11c)
120               #If cartesian, number of z cells [-]; if tensor, variable array of z-refinment; if corner-point, fix array of z-refinment (18 entries)
20 20             #Temperature bottom and top rig [C]
1.2 110000 1      #Datum [m], pressure at the datum [Pa], and multiplier for the permeability in the z direction [-] 
1e-9 1.6e-5       #Diffusion (in liquid and gas) [m^2/s]
0 0               #Rock specific heat and density (for spe11b/c)
0 0 0             #Added pore volume on top boundary (for spe11a [if 0, free flow bc]), pore volume on lateral boundaries, and width of buffer cell [m] (for spe11b/c)
0 0               #Elevation of the parabola and back boundary [m] (for spe11c)

"""Set the saturation functions"""
(max(0, (s_w - swi) / (1 - swi))) ** 2                                                        #Wetting rel perm saturation function [-]
(max(0, (1 - s_w - sni) / (1 - sni))) ** 2                                                    #Non-wetting rel perm saturation function [-]
penmax * math.erf(pen * ((s_w-swi) / (1.-swi)) ** (-(1.0 / 2)) * math.pi**0.5 / (penmax * 2)) #Capillary pressure saturation function [Pa]
(np.exp(np.flip(np.linspace(0, 5.0, npoints))) - 1) / (np.exp(5.0) - 1)                       #Points to evaluate the saturation functions (s_w) [-]

Just from a data perspective, this is a:

  • line-based plain-text format
  • supporting comments (#)
  • with custom markup elements ("""), I am actually unclear if they serve a purpose or are comments
  • only containing values, not key-value pairs (e.g. x=y)
  • since there are no keys it seems to depend on the order of arguments
  • without a hierarchy (there are no sections, unless the """ stands for that)

Such a type of input format has a number of serious challenges:

  • it is not versioned (i.e. you cannot tell when and for what version of pyopenspe you made that file)
  • it is not backward compatible (old files will not work with new versions of pyopenspe if the number or order of parameters change)
  • it is not expressive (without the comments I have no idea what each line means)
  • it is not readable by standard input parser libraries (e.g. configparser, or argparse), instead you wrote your own parser (in utils/inputvalues.py)
  • writing your own parser is actually pretty hard to get correct, e.g. I think the following is true:
    • your parser does not fail gracefully if I do not follow the format (e.g. put in the wrong number of lines, put in too many arguments on one line, put in the wrong type, put values outside of ranges)
    • your parser does not support default arguments, specifying the expected type of a value, specifying the expected data range of a value, a documentation string that belongs to a key-value pair, other types of information about the value (e.g. if it is deprecated and will be removed)

I am not pointing this out, because I want to diminish your effort. Constructing data formats and writing parsers is just hard (e.g. one of the projects I contribute to that uses a custom parser has a parser of 2500 lines of C++ code), which is why many state-of-the-art software makes use of existing file formats and established parser libraries that can read them.

Currently the following formats are widely in-use:

E.g. the input file listed above, but converted to TOML could look like:

# Set the full path to the flow executable and flags
flow_exectuable = "mpirun -np 33 flow --tolerance-mb=1e-7 --linear-solver=cprw --newton-min-iterations=1 --enable-tuning=true --enable-opm-rst-file=true --output-extra-convergence-info=steps,iterations"

[Model parameters]
spe_case = "spe11a master"     #Name of the spe case (spe11a, spe11b, or spe11c) and OPM Flow version = "version" (master or release)
co2_model = ["complete", "gaswater"] #Name of the co2 model (immiscible, convective [convective requires a Flow version newer than 22-08-2024], or complete) and co2store implementation (gaswater or gasoil [oil properties are set to water internally in OPM flow])
grid = "cartesian"         #Type of grid (cartesian, tensor, or corner-point)
grid_length = [2.8, 0.01, 1.2]      #Length, width, and depth [m]
...

[Saturation functions]
wetting_permeability_function = "(max(0, (s_w - swi) / (1 - swi))) ** 2"
...

And this file could be read into a python dictionary as simple as:

import tomllib

with open("spe11b.toml", "rb") as f:
    data = tomllib.load(f)

You can then check that all necessary entries in the dict exist with the correct type, or create default values if parameters are not given in the file.

I understand that this is one of the basic assumptions about the user interface of pyopenspe11 and you already distributed this to a number of users, but I would like to at least urge you to consider to switch a better data format. You will make your own life and that of your users unbelievably hard with the current data format, and better formats are available and easy to use / implement. Additionally using standard data formats is one of the guiding principles of FAIR research software (https://www.go-fair.org/fair-principles/, I1. (Meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation.](https://www.go-fair.org/fair-principles/i1-metadata-use-formal-accessible-shared-broadly-applicable-language-knowledge-representation/).

Related to openjournals/joss-reviews#7357.

@daavid00
Copy link
Collaborator

daavid00 commented Dec 3, 2024

Thanks @gassmoeller for this extensive issue.

We totally agree with you, that widely-use formats for parsing input as the ones you have mentioned should be used instead of writing own parsers. The main reason we adopted this approach was practicality, i.e., no need to define the Python variable names (i.e., less characters in the file), and only change values using the provided configuration files in the examples folder.

Then, to address this issue and comply with the guiding principles of FAIR research software you mentioned, the coming PR will add support for TOML files, i.e., we will keep the existing parsing for compatibility with previous configuration files, while TOML format will be used for input files with toml extension, e.g.,spe11b.toml.

@daavid00
Copy link
Collaborator

daavid00 commented Dec 4, 2024

Support for toml input files has been added in #102, thanks again @gassmoeller.

@gassmoeller
Copy link
Contributor Author

Thanks for the quick implementation, I am glad you picked up the idea. 🎉

we will keep the existing parsing for compatibility with previous configuration files, while TOML format will be used for input files with toml extension, e.g.,spe11b.toml

I understand the wish to stay compatible with existing input file. I would recommend to then clearly state in the documentation that the text based format is frozen/deprecated with some message, and after a reasonable time ultimately remove the text format. Keeping two input formats of which one is increasingly outdated will split your user base and eventually confuse new users. This is something for the future of course.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants