-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge branch 'main' into py_version_update
- Loading branch information
Showing
31 changed files
with
820 additions
and
199 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,18 @@ | ||
# Read the Docs configuration file for Sphinx projects | ||
# See https://docs.readthedocs.io/en/stable/config-file/v2.html for details | ||
|
||
version: 2 | ||
|
||
build: | ||
os: ubuntu-22.04 | ||
tools: | ||
python: "3.9" | ||
|
||
python: | ||
install: | ||
- requirements: docs/requirements.txt | ||
- method: pip | ||
path: . | ||
|
||
sphinx: | ||
configuration: docs/source/conf.py |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,2 +1,7 @@ | ||
# esm_catalog_utils | ||
|
||
tools/utilities to support the usage of catalogs to access and analyze ESM output | ||
|
||
## Documentation | ||
|
||
https://esm-catalog-utils.readthedocs.io/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,5 @@ | ||
sphinx<6.0 | ||
sphinx_rtd_theme | ||
urllib3<2.0 | ||
sphinx | ||
furo | ||
intake-esm | ||
pydantic<2.0 | ||
myst-nb |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
.. currentmodule:: esm_catalog_utils | ||
|
||
############# | ||
API reference | ||
############# | ||
|
||
This page provides an auto-generated summary of esm_catalog_utils' API. | ||
|
||
Top-level functions | ||
=================== | ||
|
||
.. autosummary:: | ||
:toctree: generated/ | ||
|
||
caseroot_to_esm_datastore | ||
directory_to_esm_datastore | ||
caseroot_to_case_metadata | ||
case_metadata_to_esm_datastore | ||
parse_file_cesm | ||
parse_path_cesm |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,55 @@ | ||
================= | ||
Developer's Guide | ||
================= | ||
|
||
Coding Style | ||
------------ | ||
|
||
Code Formatting | ||
~~~~~~~~~~~~~~~ | ||
|
||
The code of the package is formatted using the tools `black | ||
<https://black.readthedocs.io/>`_ and `isort <https://pycqa.github.io/isort/>`_. | ||
This ensures that the code across the package has a consistent appearance. | ||
|
||
Documentation Strings | ||
~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
Documentation strings (docstrings) follow the Docstring Standard from the | ||
`numpy Style guide <https://numpydoc.readthedocs.io/en/latest/format.html>`_. | ||
This standard describes how the content of docstrings is organized. | ||
Docstring are written using `reStructuredText | ||
<http://docutils.sourceforge.net/rst.html>`_ markup syntax and are rendered | ||
into documentation using `Sphinx <https://www.sphinx-doc.org/>`_. | ||
|
||
Function Annotations/Type Hints | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
`Functions annotations <https://peps.python.org/pep-3107/>`_ are used to | ||
document the types of function’s parameters and return values. | ||
This enables users of the package to use external tools like `mypy | ||
<https://mypy.readthedocs.io/en/stable/>`_ to help ensure that they're | ||
using the package properly. | ||
Python's `typing module <https://peps.python.org/pep-0484/>`_ is used to | ||
support the annotations. | ||
|
||
Testing | ||
------- | ||
|
||
Testing is performed with continuous integration using `github actions | ||
<https://github.com/features/actions>`_. | ||
Testing is performed with python versions 3.7 through 3.11. | ||
Testing consists of the following: | ||
|
||
- Run the source code through `black <https://black.readthedocs.io/>`_ and | ||
`isort <https://pycqa.github.io/isort/>`_ to verify that the desired code | ||
formatting is adhered to. | ||
- Run the source code through `flake8 <https://flake8.pycqa.org/>`_, which | ||
analyzes the code and detects various errors. | ||
- Run the source code through `mypy | ||
<https://mypy.readthedocs.io/en/stable/>`_, to ensure that variable types | ||
are used appropriately throughout the package. | ||
- Run unit tests, located in the `tests` subdirectory. The unit tests include | ||
creating catalogs from internally generated input files and verifying that | ||
the generated catalogs match baseline catalogs that are included in the | ||
repository. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,33 @@ | ||
================================== | ||
ESM Catalog Background Information | ||
================================== | ||
|
||
A simplified view of ESM catalogs is that they consist of the paths of | ||
ESM output files, metadata about these files, e.g., names of data variables | ||
in the files and date ranges covered, and metadata about how the data files | ||
can be aggregated together. | ||
|
||
More generally, data files can reside in the cloud, in which case `URIs | ||
<https://en.wikipedia.org/wiki/Uniform_Resource_Identifier>`_ are used | ||
instead of paths, and data files can be in a format where their content is | ||
spread across multiple files, e.g., :std:doc:`zarr <zarr:index>`. | ||
In the following, ESM output is referred to as assets, to recognize these | ||
generalizations. | ||
|
||
Metadata about the assets referred to by an ESM catalog (paths or URIs, | ||
data variable names, date ranges, etc.) is stored in memory in a | ||
:py:mod:`pandas` :py:class:`~pandas.DataFrame` object, and on disk in a | ||
comma-separated values (CSV) file. | ||
|
||
The primary data structure in :std:doc:`intake-esm <intake-esm:index>` | ||
to support ESM catalogs is the :std:doc:`esm_datastore | ||
<intake-esm:reference/api>` class. | ||
Loosely speaking, this class consists of an | ||
:std:doc:`intake-esm:reference/esm-catalog-spec` and functions that operate | ||
on class objects. | ||
The :std:doc:`intake-esm:reference/esm-catalog-spec` consists of a | ||
dictionary of asset metadata that is available, i.e., columns in the | ||
above-mentioned CSV file, metadata about how the assets can be aggregated | ||
together, and some other metadata, such as a description of the catalog. | ||
The metadata regarding aggregation is stored in an `aggregation control object | ||
<https://intake-esm.readthedocs.io/en/stable/reference/esm-catalog-spec.html#aggregation-control-object>`_. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,133 @@ | ||
============= | ||
General Usage | ||
============= | ||
|
||
Creating a Catalog | ||
------------------ | ||
|
||
Catalogs, i.e. :std:doc:`esm_datastore <intake-esm:reference/api>` objects, | ||
are created in :mod:`esm_catalog_utils` from a casename and a list of | ||
directories containing model output. | ||
The casename and list of directories are stored in a dictionary with | ||
keys ``case`` and ``output_dirs`` respectively. | ||
We refer to this dictionary as ``case_metadata``. | ||
The function :func:`~esm_catalog_utils.case_metadata_to_esm_datastore` | ||
takes a *case_metadata* argument and returns a :std:doc:`esm_datastore | ||
<intake-esm:reference/api>` object for the output files in ``output_dirs`` | ||
and its subdirectories. | ||
Additional arguments are described in its :func:`API documentation | ||
<esm_catalog_utils.case_metadata_to_esm_datastore>`. | ||
|
||
:mod:`esm_catalog_utils` also provides helper functions that generates | ||
the ``case_metadata`` dictionary in particular use cases, calls :func:`~esm_catalog_utils.case_metadata_to_esm_datastore`, and | ||
returns the result. | ||
|
||
:func:`~esm_catalog_utils.directory_to_esm_datastore` is a helper function | ||
for the use case of having model output in a single top-level directory | ||
and its subdirectories. | ||
The *dir* argument of :func:`~esm_catalog_utils.directory_to_esm_datastore` | ||
is the top-level directory where the model output is located. | ||
The casename can be either passed as the *case* argument to | ||
:func:`~esm_catalog_utils.directory_to_esm_datastore` | ||
or inferred from the basename of *dir*. | ||
|
||
:func:`~esm_catalog_utils.caseroot_to_esm_datastore` is a helper function | ||
that takes a *caseroot* argument. | ||
It determines the ``case_metadata``, the casename and location of the model | ||
output, from the xml files in *caseroot*. | ||
|
||
Additional arguments to these helper functions are passed through to | ||
:func:`~esm_catalog_utils.case_metadata_to_esm_datastore`. | ||
Example usage of these helper funcions is provided in the | ||
:ref:`notebooks`. | ||
|
||
Parallelization | ||
~~~~~~~~~~~~~~~ | ||
|
||
Extracting the metadata from model output files, such as the data variable | ||
names and date ranges, involves opening the files and examining the file's | ||
metadata. | ||
For long runs, there can tens of thousands of native model history files. | ||
Opening all of these files and examining their metadata can take a | ||
considerable amount of time. | ||
In order to speed up this process, | ||
:func:`~esm_catalog_utils.case_metadata_to_esm_datastore` can use | ||
:std:doc:`dask:index` to accelerate this embarrassingly parallel task. | ||
If the *use_dask* argument to | ||
:func:`~esm_catalog_utils.case_metadata_to_esm_datastore` is ``True``, then | ||
it will wrap the file open and query operations inside | ||
:std:doc:`dask:index` :py:class:`~dask.delayed.Delayed` objects and execute | ||
them in parallel. | ||
|
||
This should only be done if | ||
:func:`~esm_catalog_utils.case_metadata_to_esm_datastore` is called after | ||
instantiating a :std:doc:`dask.distributed:index` | ||
:py:class:`~distributed.Client`, as otherwise an error may be raised. | ||
The default value for *use_dask* is ``False``. | ||
|
||
The *use_dask* argument can also be passed to the helper functions | ||
:func:`~esm_catalog_utils.directory_to_esm_datastore` and | ||
:func:`~esm_catalog_utils.caseroot_to_esm_datastore`, and it will be passed | ||
through to :func:`~esm_catalog_utils.case_metadata_to_esm_datastore`. | ||
|
||
Writing and Reading a Catalog | ||
----------------------------- | ||
|
||
:std:doc:`esm_datastore <intake-esm:reference/api>` objects can be written | ||
to disk using the object's :func:`serialize` method, which is documented in | ||
the intake-esm :std:doc:`intake-esm:reference/api`. | ||
The resulting files can be read using :func:`intake.open_esm_datastore`. | ||
Example usage of these methods and functions is provided in the | ||
:ref:`notebooks`. | ||
|
||
Updating a Catalog | ||
------------------ | ||
|
||
Even with the parallel speed-up provided by *use_dask*, generating a | ||
catalog for a long run takes a non-trivial amount of time. | ||
A use case for analysis of ESM output that regularly occurs, particularly | ||
during a development cycle, is to analyze a run, extend the run, and | ||
analyze the extended run. | ||
:func:`~esm_catalog_utils.case_metadata_to_esm_datastore` has an argument | ||
named *esm_datastore_in* to accelerate this use case. | ||
If this argument is passed, | ||
:func:`~esm_catalog_utils.case_metadata_to_esm_datastore` will return an | ||
:py:class:`esm_datastore` object with entries appended to | ||
*esm_datastore_in*. | ||
The paths determined from the *case_metadata* argument to | ||
:func:`~esm_catalog_utils.case_metadata_to_esm_datastore` are checked for | ||
existence in *esm_datastore_in*'s DataFrame ``df``. | ||
If the path is present in ``df`` and the file's size differs from its size | ||
in *esm_datastore_in*, then the entry for that path is recreated. | ||
If the file's size is the same as its size in *esm_datastore_in*, | ||
then that file's catalog entry is propagated without reopening the file | ||
and querying its metadata. | ||
Because checking a file's size is much faster than this metadata query, | ||
this option provides a considerable speed-up in this use case. | ||
|
||
The *esm_datastore_in* argument can also be passed to the helper functions | ||
:func:`~esm_catalog_utils.directory_to_esm_datastore` and | ||
:func:`~esm_catalog_utils.caseroot_to_esm_datastore`, and it will be passed | ||
through to :func:`~esm_catalog_utils.case_metadata_to_esm_datastore`. | ||
|
||
Example usage of the *esm_datastore_in* is provided in the | ||
:ref:`notebooks`. | ||
|
||
Catalog Issues Specific to History Files | ||
---------------------------------------- | ||
In some model analysis use cases, the model output being analyzed has been | ||
post-processed into files that have a single data variable per file. | ||
In contrast, native model history file output, the files written directly | ||
by ESMs, typically has multiple data variables per file. | ||
In this use case, the `varname` column of the CSV file component of the | ||
ESM catalog is a list. | ||
Additional steps are necessary to properly parse such files when calling | ||
:func:`intake.open_esm_datastore`. | ||
As described in the :std:doc:`intake-esm documentation | ||
<intake-esm:how-to/use-catalogs-with-assets-containing-multiple-variables>`, | ||
one approach to handle this use case is to pass the value | ||
``{"converters": {"varname": ast.literal_eval}}`` to the *read_csv_kwargs* | ||
argument of :func:`intake.open_esm_datastore` when | ||
reading the catalog. | ||
This is demonstrated in the :doc:`history file example notebook | ||
<notebooks/ex1_caseroot_hist>`. |
Oops, something went wrong.