-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extension proposal: chunk statistics #319
Comments
This is a cool and useful idea, and people have proposed it before. The challenge for the exact thing you suggest here will be managing Alternatively, if we view "chunk statistics" in the abstract as a tabular data structure, with 1 row per chunk, then there are lots of ways to store that data structure, some of which might not generate so many files. If you have the time, maybe it would be worth hacking together a prototype? If that prototype proves useful to you, that will be sure proof that the idea might be useful to someone else, and we can worry about scaling details later. |
I think this idea dovetails with concepts used in tools like kerchunk and virtualizarr, where an explicit list of chunks (a "chunk manifest") is created and used for data retrieval, instead of relying on the discovered state of the storage backend. That same chunk manifest could perhaps be augmented with columns for chunk statistics? |
Agreed, manifests are an obvious place to put this kind of thing - just one more value per chunk. This will increase the size of the manifests, but they ought to end up small regardless. The idea sort of suggests that manifests ought to be the only way to store extra information per chunk, which remain encoded data only. (Noting that checksum codec implementations currently do store data in the chunk along with the data) How to make use of the metadata is another issue. We are talking about adding a "query plan" layer to what zarr currently does. The related suggestion was for codecs where the parameters vary between chunks, e.g., a different scale/offset for each. That also requires updating the read pipelin, but in a different way. |
I agree - which is why I raised #305 to discuss the general problem of storing metadata that scales with EDIT: Since that issue was raised Icechunk was released, which currently uses |
Thanks @barbuz for opening this issue. Its a a topic we'e already been thinking about with Icechunk and we've intentionally left room in the Chunk Manifest spec for such a feature. As @TomNicholas points out, we're currently using MessagePack for the manifests but this is extensible. We could pick a different container for very large manifests in the future. I also agree with @martindurant - we'll need to think carefully about how Zarr implementations would take advantage of such statistics. |
Thank you all for your feedback on this. I appreciate the points you made, which I can summarise in two main items: How to manage O(num_chunks) metadataThe approach I envisioned is based on a separate file for each chunk because this would make it easy to do parallel writes and reads across several chunks, in line with the main philosophy of Zarr. I know that this is not the most compact way of storing this information, but maybe it won't be a big issue in practice? Some tests could be needed to be sure. How to use the statistics in queriesThis, I believe, is the hard part. My naive idea would be for the Zarr libraries to expose access to these statistics, and for the data analysis libraries built on top of Zarr to use them to make their queries more efficient. I will admit though that I have been mostly managing Zarr data through |
This is a draft idea for a possible Zarr extension, looking for feedback before putting more work into it.
The idea in short
For each chunk, save a separate json file containing statistics about values in that chunk (including e.g. minimum, maximum, average, number of non-nan values).
Readers can read this very small file instead of the chunk itself to perform some kinds of computation more efficiently.
Advantages
min(x)<5
andmax(x)>5
)Costs
Is this something that could work? Something that already exists? Something that would never be useful?
I haved worked with Zarr for a while but I haven't worked on Zarr yet, so any feedback is appreciated.
The text was updated successfully, but these errors were encountered: