Replace this package with a VirtualiZarr reader? #337

TomNicholas · 2024-08-09T22:23:26Z

I don't know anything really about the format of MITgcm output files other than that they are some bespoke binary format, but I can't help wonder if it would actually be easier to create a cloud-optimized version of MITgcm data by writing a reader for virtualizarr (i.e. a kerchunk reader) rather than actually converting the binary data to zarr.

The advantages would be that

if you want to make the data available to xarray users, even in the cloud, you don't have to alter or duplicate the original data (for cloud access you could just upload the original output files to a bucket with no alterations),
that reader would work for any MITgcm output (so effectively replacing most of xMITgcm),
it would mean that creating the over-arching actual virtual zarr store becomes the same problem that everyone else has (that the rest of the virtualizarr package is meant to solve).

It would involve essentially rewriting this function

xmitgcm/xmitgcm/utils.py

Line 87 in 63ba751

def read_mds(fname, iternum=None, use_mmap=None, endian='>', shape=None,

to look like either one of the kerchunk readers or ideally more like this
zarr-developers/VirtualiZarr#113

Because it seems MITgcm output already separates metadata from data to some degree this could potentially work really nicely...

See also zarr-developers/VirtualiZarr#218

One downside of that approach would be the inability to alter the chunking though.

cc @cspencerjones

TomNicholas · 2024-08-12T15:31:19Z

Turns out there is already an issue discussing something very similar (which didn't appear when I searched "kerchunk") - see #28 (comment).

cspencerjones · 2024-08-27T15:49:51Z

I've been thinking about this, and I'm not 100% sure that it's a good idea in the end. The main issue is that most MITgcm output is not compressed at all, so direct upload to the cloud may not be something we want to encourage, especially for realistic geometry simulations which contain a lot of land (compression usually does not reduce the size of ocean output very much). The upside of the format is that flexible chunking should be possible in theory.

LLC2160 & 4320 data is in a bespoke "shrunk" (still binary) format, where the land points have been removed, so further compression would have very limited benefit. But reading it would require writing code that's very specific to this dataset. I do not believe that further datasets will be generated in this bespoke format. Some of the data access problem with this data has nothing to do with format and is simply caused by the limited bandwidth out of Pleiades. Still, given the choice between a general MITgcm reader and a more specific reader for LLC2160/4320, I think a more specific reader would be most useful because this data is still by far the heaviest lift most people are doing, and many people cannot use the data because of how difficult access still is. (This is all just my opinion and I am prepared to hear other arguments)

rabernat · 2024-08-27T16:00:18Z

I actually started something like this three years ago! https://github.com/rabernat/mds2zarr - of course VirtualiZarr is much better and more robust approach.

I agree with @cspencerjones that the funky compression of the LLC data is potentially a blocker. If we can make this Zarr-compatible, it should be possible.

However, that is really an edge case--most "normal" MDS data output from MITgcm should be perfectly fine as uncompressed flat binary.

TomNicholas · 2024-08-27T16:10:16Z

the funky compression of the LLC data is potentially a blocker. If we can make this Zarr-compatible, it should be possible.

This seems like an analogous problem to zarr-developers/zarr-specs#303 - i.e. it could be solved by defining a special zarr codec that is specific to this data format.

rabernat · 2024-08-27T16:14:44Z

Except it's really complicated because the "codec" for decoding each array relies on an external dataset (the null mask) which doesn't even have the same shape as the data. This breaks many of the abstractions implicit in the "codec" interface.

TomNicholas mentioned this issue Aug 12, 2024

Reading mult-tile output. #28

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace this package with a VirtualiZarr reader? #337

Replace this package with a VirtualiZarr reader? #337

TomNicholas commented Aug 9, 2024 •

edited

Loading

TomNicholas commented Aug 12, 2024

cspencerjones commented Aug 27, 2024

rabernat commented Aug 27, 2024

TomNicholas commented Aug 27, 2024

rabernat commented Aug 27, 2024

Replace this package with a VirtualiZarr reader? #337

Replace this package with a VirtualiZarr reader? #337

Comments

TomNicholas commented Aug 9, 2024 • edited Loading

TomNicholas commented Aug 12, 2024

cspencerjones commented Aug 27, 2024

rabernat commented Aug 27, 2024

TomNicholas commented Aug 27, 2024

rabernat commented Aug 27, 2024

TomNicholas commented Aug 9, 2024 •

edited

Loading