Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zarr Metadata #59

Merged
merged 12 commits into from
Jul 25, 2024
Merged

Zarr Metadata #59

merged 12 commits into from
Jul 25, 2024

Conversation

thwllms
Copy link
Contributor

@thwllms thwllms commented Jul 23, 2024

Changes

  • Adds new methods to RasPlanHdf to generate Kerchunk-type Zarr metadata from HEC-RAS HDF files. These methods return dictionaries.
    • zmeta_mesh_cells_timeseries_output
    • zmeta_mesh_faces_timeseries_output
    • zmeta_reference_lines_timeseries_output
    • zmeta_reference_points_timeseries_output
  • Renames several methods and keeps a deprecated name:
    • mesh_cells_timeseries_output (old: mesh_timeseries_output_cells)
    • mesh_faces_timeseries_output (old: mesh_timeseries_output_faces)
  • Adds methods to return DataFrame objects of summary output data:
    • mesh_cells_summary_output
    • mesh_faces_summary_output

Zarr Metadata

The goal of generating Zarr metadata from RAS HDF files is the ability to treat multiple RAS HDF files as if they are a single file. By combining the dictionaries returns from the zmeta_* methods listed above, we can open many RAS HDF files with the same structure (i.e., from the same model, with the same settings, per FFRD) as if they are a single data source. The metadata generated by these methods contains information about where to find certain data within a single HDF file; when metadata from multiple files is combined, we can operate on many HDF files as if they are one. This is explained in the new Advanced documentation page.

From the new Advanced docs page

The cell timeseries output for a single simulation might look something like this:

>>> from rashdf import RasPlanHdf
>>> plan_hdf = RasPlanHdf.open_uri("s3://bucket/simulations/1/BigRiver.p01.hdf")
>>> plan_hdf.mesh_cells_timeseries_output("BigRiverMesh1")
<xarray.Dataset> Size: 66MB
Dimensions:                              (time: 577, cell_id: 14188)
Coordinates:
* time                                 (time) datetime64[ns] 5kB 1996-01-14...
* cell_id                              (cell_id) int64 114kB 0 1 ... 14187
Data variables:
    Water Surface                        (time, cell_id) float32 33MB dask.array<chunksize=(3, 14188), meta=np.ndarray>
    Cell Cumulative Precipitation Depth  (time, cell_id) float32 33MB dask.array<chunksize=(3, 14188), meta=np.ndarray>
Attributes:
    mesh_name:  BigRiverMesh1

Note that the example below requires installation of the optional libraries kerchunk, zarr, fsspec, and s3fs:

from rashdf import RasPlanHdf
from kerchunk.combine import MultiZarrToZarr
import json

# Example S3 URL pattern for HEC-RAS plan HDF5 files
s3_url_pattern = "s3://bucket/simulations/{sim}/BigRiver.p01.hdf"

zmeta_files = []
sims = list(range(1, 11))

# Generate Zarr metadata for each simulation
for sim in sims:
    s3_url = s3_url_pattern.format(sim=sim)
    plan_hdf = RasPlanHdf.open_uri(s3_url)
    zmeta = plan_hdf.zmeta_mesh_cells_timeseries_output("BigRiverMesh1")
    json_file = f"BigRiver.{sim}.p01.hdf.json"
    with open(json_file, "w") as f:
        json.dump(zmeta, f)
    zmeta_files.append(json_file)

# Combine Zarr metadata files into a single Kerchunk metadata file
# with a new "sim" dimension
mzz = MultiZarrToZarr(zmeta_files, concat_dims=["sim"], coo_map={"sim": sims})
mzz_dict = mss.translate()

with open("BigRiver.combined.p01.json", "w") as f:
    json.dump(mzz_dict, f)

Now, we can open the combined dataset with xarray:

import xarray as xr

ds = xr.open_dataset(
    "reference://",
    engine="zarr",
    backend_kwargs={
        "consolidated": False,
        "storage_options": {"fo": "BigRiver.combined.p01.json"},
    },
    chunks="auto",
)

The resulting combined dataset includes a new sim dimension:

<xarray.Dataset> Size: 674MB
Dimensions:                              (sim: 10, time: 577, cell_id: 14606)
Coordinates:
* cell_id                              (cell_id) int64 117kB 0 1 ... 14605
* sim                                  (sim) int64 80B 1 2 3 4 5 6 7 8 9 10
* time                                 (time) datetime64[ns] 5kB 1996-01-14...
Data variables:
    Cell Cumulative Precipitation Depth  (sim, time, cell_id) float32 337MB dask.array<chunksize=(10, 228, 14606), meta=np.ndarray>
    Water Surface                        (sim, time, cell_id) float32 337MB dask.array<chunksize=(10, 228, 14606), meta=np.ndarray>
Attributes:
    mesh_name:  BigRiverMesh1

Copy link

codecov bot commented Jul 23, 2024

Codecov Report

Attention: Patch coverage is 97.87234% with 2 lines in your changes missing coverage. Please review.

Files Coverage Δ
src/rashdf/base.py 90.47% <100.00%> (+18.25%) ⬆️
src/rashdf/utils.py 91.34% <100.00%> (+0.72%) ⬆️
src/rashdf/plan.py 97.18% <97.50%> (-0.49%) ⬇️

chunk_meta[chunk_key] = [str(self._loc), value["offset"], value["size"]]
zarr_tmp = zarr.MemoryStore()
ds.to_zarr(zarr_tmp, mode="w", compute=False, encoding=encoding)
zarr_meta = {"version": 1, "refs": {}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whats the version of 1 here refer to? Im guessing thats just a zarr metadata version that will only be updated when changes are made specifically to the zarr metadata functions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's the Zarr metadata version. More details here: https://fsspec.github.io/kerchunk/spec.html

Copy link
Contributor

@sray014 sray014 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't see any issues. The advanced docs are a nice touch.

Copy link
Contributor

@zherbz zherbz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried hard but could not spot anything to call out. Looks great!

@thwllms thwllms merged commit 6b45130 into main Jul 25, 2024
5 checks passed
@thwllms thwllms deleted the feature/zmeta branch July 25, 2024 15:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants