Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extension proposal: chunk statistics #319

Open
barbuz opened this issue Nov 14, 2024 · 6 comments
Open

Extension proposal: chunk statistics #319

barbuz opened this issue Nov 14, 2024 · 6 comments

Comments

@barbuz
Copy link

barbuz commented Nov 14, 2024

This is a draft idea for a possible Zarr extension, looking for feedback before putting more work into it.

The idea in short

For each chunk, save a separate json file containing statistics about values in that chunk (including e.g. minimum, maximum, average, number of non-nan values).
Readers can read this very small file instead of the chunk itself to perform some kinds of computation more efficiently.

Advantages

  • Efficient where queries on variables (e.g. to build a mask of all points with variable x larger than 5 you can look at all chunk stats and only have to load the chunks where min(x)<5 and max(x)>5 )
  • Efficient computation of mean/sum over multiple chunks
  • "quick preview" of large portions of data by loading the mean values for each chunk rather than the data itself

Costs

  • Storing an extra file for each chunk (but each file will be less than 100B, so this is a low cost)
  • Overhead of querying more objects: even when the total number of bytes read gets decreased by this extension it is likely that the number of files read will increase (all stata files plus some chunk files)
  • Added complexity to zarr readers to support the extension (too much to be worth considering?)

Is this something that could work? Something that already exists? Something that would never be useful?

I haved worked with Zarr for a while but I haven't worked on Zarr yet, so any feedback is appreciated.

@d-v-b
Copy link
Contributor

d-v-b commented Nov 14, 2024

This is a cool and useful idea, and people have proposed it before. The challenge for the exact thing you suggest here will be managing O(num_chunks) statistics files. For large Zarr arrays, we will be basically doubling the number of objects in storage. But maybe that's worth it?

Alternatively, if we view "chunk statistics" in the abstract as a tabular data structure, with 1 row per chunk, then there are lots of ways to store that data structure, some of which might not generate so many files.

If you have the time, maybe it would be worth hacking together a prototype? If that prototype proves useful to you, that will be sure proof that the idea might be useful to someone else, and we can worry about scaling details later.

@d-v-b
Copy link
Contributor

d-v-b commented Nov 14, 2024

I think this idea dovetails with concepts used in tools like kerchunk and virtualizarr, where an explicit list of chunks (a "chunk manifest") is created and used for data retrieval, instead of relying on the discovered state of the storage backend. That same chunk manifest could perhaps be augmented with columns for chunk statistics?

cc @martindurant @TomNicholas

@martindurant
Copy link
Member

Agreed, manifests are an obvious place to put this kind of thing - just one more value per chunk. This will increase the size of the manifests, but they ought to end up small regardless. The idea sort of suggests that manifests ought to be the only way to store extra information per chunk, which remain encoded data only. (Noting that checksum codec implementations currently do store data in the chunk along with the data)

How to make use of the metadata is another issue. We are talking about adding a "query plan" layer to what zarr currently does.

The related suggestion was for codecs where the parameters vary between chunks, e.g., a different scale/offset for each. That also requires updating the read pipelin, but in a different way.

@TomNicholas
Copy link
Member

TomNicholas commented Nov 14, 2024

This is a cool and useful idea, and people have proposed it before. The challenge for the exact thing you suggest here will be managing O(num_chunks) statistics files.

I agree - which is why I raised #305 to discuss the general problem of storing metadata that scales with O(num_chunks).

EDIT: Since that issue was raised Icechunk was released, which currently uses msgpack to store the manifests.

@jhamman
Copy link
Member

jhamman commented Nov 14, 2024

Thanks @barbuz for opening this issue. Its a a topic we'e already been thinking about with Icechunk and we've intentionally left room in the Chunk Manifest spec for such a feature. As @TomNicholas points out, we're currently using MessagePack for the manifests but this is extensible. We could pick a different container for very large manifests in the future.

I also agree with @martindurant - we'll need to think carefully about how Zarr implementations would take advantage of such statistics.

@barbuz
Copy link
Author

barbuz commented Nov 21, 2024

Thank you all for your feedback on this. I appreciate the points you made, which I can summarise in two main items:

How to manage O(num_chunks) metadata

The approach I envisioned is based on a separate file for each chunk because this would make it easy to do parallel writes and reads across several chunks, in line with the main philosophy of Zarr. I know that this is not the most compact way of storing this information, but maybe it won't be a big issue in practice? Some tests could be needed to be sure.
The approaches based on manifests used by some tools are great, but maybe they should be kept as options to be decided by each tool, while the base Zarr specs should focus on defining simple and interoperable ways to store data.

How to use the statistics in queries

This, I believe, is the hard part. My naive idea would be for the Zarr libraries to expose access to these statistics, and for the data analysis libraries built on top of Zarr to use them to make their queries more efficient. I will admit though that I have been mostly managing Zarr data through xarray and I am not sure where the line is between what xarray does and what zarr-python does. I will need to look into this, but if anyone has a clear picture of how these interactions are structured and an idea about how these chunk statistics could fit in I would love to hear it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants