-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace this package with a VirtualiZarr reader? #337
Comments
Turns out there is already an issue discussing something very similar (which didn't appear when I searched "kerchunk") - see #28 (comment). |
I've been thinking about this, and I'm not 100% sure that it's a good idea in the end. The main issue is that most MITgcm output is not compressed at all, so direct upload to the cloud may not be something we want to encourage, especially for realistic geometry simulations which contain a lot of land (compression usually does not reduce the size of ocean output very much). The upside of the format is that flexible chunking should be possible in theory. LLC2160 & 4320 data is in a bespoke "shrunk" (still binary) format, where the land points have been removed, so further compression would have very limited benefit. But reading it would require writing code that's very specific to this dataset. I do not believe that further datasets will be generated in this bespoke format. Some of the data access problem with this data has nothing to do with format and is simply caused by the limited bandwidth out of Pleiades. Still, given the choice between a general MITgcm reader and a more specific reader for LLC2160/4320, I think a more specific reader would be most useful because this data is still by far the heaviest lift most people are doing, and many people cannot use the data because of how difficult access still is. (This is all just my opinion and I am prepared to hear other arguments) |
I actually started something like this three years ago! https://github.com/rabernat/mds2zarr - of course VirtualiZarr is much better and more robust approach. I agree with @cspencerjones that the funky compression of the LLC data is potentially a blocker. If we can make this Zarr-compatible, it should be possible. However, that is really an edge case--most "normal" MDS data output from MITgcm should be perfectly fine as uncompressed flat binary. |
This seems like an analogous problem to zarr-developers/zarr-specs#303 - i.e. it could be solved by defining a special zarr codec that is specific to this data format. |
Except it's really complicated because the "codec" for decoding each array relies on an external dataset (the null mask) which doesn't even have the same shape as the data. This breaks many of the abstractions implicit in the "codec" interface. |
I don't know anything really about the format of MITgcm output files other than that they are some bespoke binary format, but I can't help wonder if it would actually be easier to create a cloud-optimized version of MITgcm data by writing a reader for virtualizarr (i.e. a kerchunk reader) rather than actually converting the binary data to zarr.
The advantages would be that
It would involve essentially rewriting this function
xmitgcm/xmitgcm/utils.py
Line 87 in 63ba751
to look like either one of the kerchunk readers or ideally more like this
zarr-developers/VirtualiZarr#113
Because it seems MITgcm output already separates metadata from data to some degree this could potentially work really nicely...
See also zarr-developers/VirtualiZarr#218
One downside of that approach would be the inability to alter the chunking though.
cc @cspencerjones
The text was updated successfully, but these errors were encountered: