-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added consolidated metadata to spec #309
base: main
Are you sure you want to change the base?
Added consolidated metadata to spec #309
Conversation
Going back to #136 (comment) (to ZEP or not to ZEP), I think this could potentially use some discussion around whether there is an opportunity to make the consolidated metadata one implementation of a wider interface. The most general abstraction I've been considering is "metadata loader" (similar to the storage transformer that was initially proposed for sharding). Basically, could we have extensible (i.e. you can write your own) ways of looking up metadata that would be registered in the zarr.json? I owe you a better description but to get them down on paper before I get inundated again, here are some similar requirements/features that I could see:
In my mind, if one of these alternative mechanisms is active, it could be the sole source of information, rather than duplicating the metadata. Some of these are more useful for user attributes rather than the entire metadata, and certainly several may be more difficult without the concept of the root metadata, but it feels like this is an opportunity for us to go beyond the original consolidated metadata workaround. |
Thanks Josh! I like the idea of metadata loading being an extension point. We'll want to hash out some details before merging this, since a top-level
We could have the name be I'll play with that a bit to see how it feels in zarr-python. |
Thanks for working on this! What do you think about multiple instances of metadata consolidation in the same zarr hieararchy? E.g., Group A contains Group B, Group B generates consolidated metadata, then Group A generates consolidated metadata, which ends up containing 2 models of the hierarchy rooted at Group B. Because the consolidated metadata is part of the Group metadata, then consolidating recursively will end up with a lot of repeated content. A few options:
The latter option seems best to me; curious to hear other perspectives. |
Coming back to @joshmoore's "metadata loader" concept, the thing we need to solve now is what sort of metadata to put in the Zarr Group object to indicate "here's how you find more stuff about this hierarchy". And I'll focus that a bit more to "listing files on blob storage is messy / slow, so I want to avoid that if possible". I'm going to make a couple of statements that I think are true:
It's that interface with the store that's tripping me up when trying to generalize this. Why do we need to standardize that in the Zarr Group object? By the time the user has loaded some zarr Group, they have already specified their connection to the metadata store, right? That's how they loaded the group in the first place. With something like the current proposal, we have a way for the In short, I think my claim is that the "metadata loader" concept is mostly orthogonal to the consolidated metadata concept. Consolidated metadata is all about reducing the number of trips to the "Store", by giving a place for a Group to hold information about its children. I think I'm comfortable that a top-level One thing that probably should be addressed now / soon is the idea of storing subsets of the metadata. I'd propose something like a
|
sorry for missing your questions @d-v-b.
I like the framing (which I think was yours originally?) of thinking about consolidated metadata as a cache. Since you can have arbitrarily nested Groups / Arrays, IMO you should be able to have arbitrary consolidated metadata at any level of the hierarchy and the spec shouldn't do anything to dissuade that. As for the question about potentially duplicating data, I think that we won't have that in zarr-python. Working through your example:
I think that we'll have "repeated content" across the files. But we won't have repeated content within a single Given an example like
The zarr.json for group
And the zarr.json for
A note that this might be different from how zarr v2 treated consolidated metadata. In zarr-developers/zarr-python#2113 (comment), I think I found that zarr v2 only supported consolidated metadata at the root of a Store, not nested paths within a store. I think this might be an area where we want to diverge from zarr v2. |
I think this should be about ready to go. @LDeakin I saw that you at least started on this at LDeakin/zarrs#55. Anything come up in your implementation that might adjust what you want to see in the spec? |
Nope, the spec looks good, and no issues popped up. Thanks for writing it. |
Sorry for the slow response, @TomAugspurger. To be clear lest it not come through here, I'm super excited to have CM going into the spec!
I definitely agree that this is a tricky (though critical) part of the bootstrapping. The use case I tend to be most concerned with is everything under
Can you help me see why it's orthogonal? My hope was that that would be one of the things this interface could allow a user to do.
Except then it's already done in one specific way, no? With the metadata, e.g., currently being duplicated, no?
Agreed that this is another "parameter" for the consolidation, and I imagine there will be more.
Not that I necessarily agree, but what are the remaining (related) steps from your point-of-view? From my side, I'd be for trying to encourage some more feedback from implementers (huge kudos to @LDeakin) |
Thanks @joshmoore.
Sorry, I don't quite understand this set of questions and the "some" / "no" info concepts. Do you have an example of what you have in mind?
To me it just goes back to the original motivation for consolidated metadata: A way for a Group in a Zarr hierarchy to know a bit about its children (to avoid potentially costly repeat trips to a remote Store for common scenarios like "load all the metadata for arrays in this Group / dataset"). How exactly you load that initial Group (your "metadata loader" concept) doesn't really bear on the question of "where do I store information about children", right?
Answered above, I think. All where specifying here is how to present information about your children; no constraints on how that information gets there. In particular, the spec takes no opinion on how the consolidated metadata is actually stored. A database presenting a Zarr API could construct consolidated metadata on the fly if it wanted (and so no duplication).
This is the only one I can imagine now. If you have any other in mind let me know. I'm not worried about making a
Nothing else. We're merging the zarr-python implementation today. Overall, I'd reiterate that this PR has a narrow scope: a standardized way for Groups know about their children. Effectively, a cache for what you'd get from listing nodes under some prefix, which we already know is needed for serving Zarr data over object storage in a reasonable time. |
Ok. Bear with me @TomAugspurger, I find myself unfortunately playing bad cop across the repo. First some immediate responses to your points and then I'll take a step back and try to explain.
Looking back, I think I misunderstood you. Let's hold off on this for the moment.
I thought it did, but you may have found a hole in the plan. Ultimately, I'm trying to find an underlying abstraction for what you've built. But more on that in a second.
I think this helps me see how the Store is becoming intwined. Maybe "MetadataLoader" is the wrong metaphor, but more "MetadataTransformer" to go with the "StorageTransformer". (If, however, CM can be achieved with the StorageTransformer, 👍) The difference in my mind is the loader/mapper/transformer is config to tell us what keys to look up (whether in databases, filesystems, or buckets) rather than the existing JSON objects.
except the current definition would still have 2 keys for the same information even if the storage and/or retrieval has been optimized, right?
This concerns me, more to that below.
And in fact, now it is merged. Congrats on that and all the work but I think to be fair I will ignore that for this discussion, because though a great validation of the work here it's the cart before the horse. Ok, finally, to back up:
I understand the goal to get CM in ASAP. However, this is introducing a key into the metadata that could be with us for many years. Spec work is often about looking beyond the immediate scope, especially if we are asking others to implement. So a few, partially overlapping strategies to try not to get in your way: Broadening the scopeThis is what I was trying above and with my suggestion (somewhere?!) of going for a ZEP. By identifying an abstraction that would cover more use cases and that we would all feel comfortable having in https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#extension-points I was hoping to we would have more buy-in for the change. (Obviously, that's a big ask.) Lowering riskThe addition of extensions to Zarr v3 is still pretty untested. What you are trying to do should be made possible: dev wants to hack out a new feature, implement it and move on. 👍 Your use of Preparing for changeIf we see the need for an evolution of the configuration, it might make sense to introduce versioning internally. |
This is not entirely true for
If the configuration changes, an implementation can support that without needing an explicit version key. |
That's up to whatever storage is providing the Zarr metadata. In the case of the consolidated metadata produced by zarr-python, which would typically be written to a plain file / object store, yes. But that hypothetical database could store it however it wants. Stepping back: why is duplicating the metadata a problem? That's fundamental to this. Maybe substitute "cache" for "duplicate" and see whether it still causes any concerns.
"Broadening the scope" and "Lowering the risk" seem pretty much directly in tension with each other :) I'll push back pretty strongly on broadening the scope; this is deliberately scoped to narrowly solve the "open group and immediately inspect its children" use case, and I think that's a good thing. As for lowing the risk, I think we point to zarr-python 2.x's consolidated metadata. It seems to have worked pretty well, and this spec is essentially identical (aside from moving it to the single
I think stac's extension mechanism if much nicer than what zarr v3 provides today. When I first started on consolidated metadata, I actually started on the extension mechanism. JSON schemas for the core metadata plus a top-level
I think that would be solved by the zarr spec having a better extension mechanism (previous point) or by the |
General 💯 for thinking through the future-proofing with you guys, so super brief comments for 00:00 CET:
My gut reaction is to say we're putting too much on the back of
Largely, but you could of course draft a broader-scoped extension externally first, too ;)
But not across implementations and I have the feeling with a good deal of heuristics in consuming libraries like xarray to get the creation/detection in place. That a way to tune the metadata loading for different cases is a given, but I would hope we could also improve on it.
nods certainly something that doesn't give the sense that we might need a series of 3.x verisons.
If we think we can sufficiently lower the risk then agreed, but as currently written my instinct is we're not there yet (though perhaps @LDeakin's suggestion will find traction).
❤️ |
Maybe, but I try not to get too caught up in hypotheticals. We have a very real use case where the layout of Zarr metadata on a remote file system causes issues. So we have a tailored solution for that. Why do we need to complicate it to handle other stuff (stuff that at least I don't have a clear grasp on)? If we come up with something better that does... something related to metadata loading, then great. Let's add that to the spec. And it can use And just to reiterate again: this is a small change. A change to the core metadata, sure, but it's completely optional. And it really isn't inventing any new concepts, both because the use-case was discovered (and solved!) by users of zarr-python, and because all of the complicated objects going in the If that's not enough de-risking, then can you lay out exactly what you're looking for? |
#316 has stuff on the extension side. Happy to discuss more there, but I don't think the limitations of Zarr's extension story should bear on this PR. |
The reason I think we're in this discussion is that as things stand this adds a feature (i.e., if we were under semver a minor release). The two paths I see are:
Let me try to give one scenario of it's impact:
This is just a part of caches being hard but I think previously was less of an issue since:
Users were then responsible for cleaning things. But that no longer holds, does it? |
I avoid semver conversations as a matter of course (aside from linking to https://hynek.me/articles/semver-will-not-save-you/), and I don't know how Zarr versions its spec, but the key thing here is this is backwards compatible. https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#extension-points does mention that top-level keys can be added with
We've got two implementations from two very different languages, plus the experience of zarr-python 2.x. I think that's sufficient.
You don't need to bring in multiple implementations to hit that: you can get in that situation perfectly well using just zarr-python. And I think that's OK. I don't think the spec can or should have much to say about how objects are actually stored, or how to synchronize concurrent changes among multiple objects. That should be left to the implementations to explore. Looking through https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#id21 and https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#storage it seems like the spec is restrained with respect to Stores and Storage. Let's keep it that way.
That's again up to the library. In zarr-python we default to using it if present, and provide APIs for explicitly using it or not. |
Ha! Probably a good idea, and I don't think it's a natural fit for a spec anyway, but major/minor/patch is a distinction that at least most people understand to simplify discussing the scope of changes.
(Per the zarr-python meeting) versioning is definitely an issue, but regardless, as things stand we also have to consider the forwards compatibility.
It's not wrong for what's written in the spec, but I wonder if from the zarr-python conversation just now changed your mind here at all.
For an extension (with a working extension mechanism) I'd call that more than fair! As things stand, I think I'm certainly still struggling with the fact that merging this is a minor release and so don't consider it sufficient. (Hence my call for a ZEP from the beginning.)
I disagree especially if it's happening silently.
I see what you're saying, but this is why for such a core feature I was so intent on avoiding the need for synchronization! |
What exactly is the forwards compatibility issue here?
I don't think so, but what do you have in mind?
How do you do that non-silently? The entire point of this consolidated metadata is to avoid looking at other objects in the hierarchy for static file systems. Regardless, this is surely, clearly, obviously out of scope for the spec, right? Why would it need to take a stance on how the underlying file system operates? Leave that up to the implementations. I get the feeling that we have wildly different priors on the purpose and scope of this change. All I'm proposing is a spot for Groups to hold information about their children. Issue like how to synchronize multiple writes already affect any system built on top of Zarr using a distributed file store without support for transactions spanning multiple objects; it's not unique to consolidated metadata. Use |
:) Not on this change specifically, @TomAugspurger, it's unfortunately more pervasive. That was my point about the call: we talked through how the immutability of the spec, lack of versioning, and untested extension mechanism makes even scoped changes nerve-wracking, and I've made the same mistake here as a I did in #312 to not clearly spell that out upfront -- apologies. You're coming into a loaded situation and I was trying to help by doing what I would have with a ZEP. I still think my points (abstract interface, versioning, dealing with stale data) would ultimately make this a more robust (hopefully valuable 😄) addition, but as @rabernat describes in #312, what you are looking for at the moment seems to be a proper extension mechanism. |
The changes here use the extension and change mechanism we have. Let's use it!
I still don't have a great understanding of what exactly that proposal is, but I think #309 (comment) (from a month ago!) still has my thoughts: that seems separate and can use
I'm surprised you raised that again... I assumed it was something you raised earlier because it was a fleeting "this would be nice to have". I had that thought too. But once you dig into the implementation, it should become clear that's not really feasible for all storage systems. I can go into more detail about it if you want, but:
As an analogy: S3 didn't even provide strong read after write consistency till a few years ago. If we wanted the spec to take that strong of a stance on storage like "you should be able to read data immediately after writing it", we wouldn't have Zarr in the first place. Let the Zarr spec focus on the core components of the spec like where to place information, what information to put there; let the implementations handle the details of how that actually works in practice. And now that I'm going why stop: "stale" metadata has been valuable for zarr v2. That offers a way for data providers to update the metadata for many nodes in a hierarchy atomically.
I am so not looking for that. I just want ensure that workflows using xarray and can use Zarr. I don't particularly care about Zarr itself.
I'm probably getting short, and this aside was probably an attempt to lighten the mood, but do you think the value of consolidated metadata questionable at this stage? After all the experience with zarr V2? I'll try to summarize how I see things:
So, given all that, how do we move forward here? |
Following #312 (comment), I'm hesitant to try my standard strategy so I'll try to be blunt because it seems I haven't been communicating effectively: I'm strongly of the opinion this should be a ZEP in order to gain wider endorsement and review, or (for the moment) should be a true extension not defined in the core metadata.
Sorry, it wasn't fleeting, no. Minimally, for example, I'd spell out the semantics and expectations for an ideal storage system.
Agreed. And even there, purely at the information level, I see the potential for corruption. (FWIW, I've pondered through defrag and invalidation like steps, but none of them are really tractable.) I'd be much more comfortable if CM owned the metadata, i.e. it's less a cache and a choice that you make when saving, e.g., read-only data, because you know your access patterns.
I appreciate that you are trying to fix an upstream library, but that is a large part of my concern that the larger picture, including other implementations, is not being taken into account.
Apologies, yes. That was an attempt at levity and to be clear the "more" applied to both, i.e. "make this a more robust and hopefully more valuable as well. I know how important consolidated metadata is; there's data waiting to be converted; libraries waiting to update; science waiting to be done. But being blunt again, it's initial implementation was implemented "without particular care to Zarr", and that's something I've struggled with in my other ecosystems. So perhaps I've being overly cautious, but I'm concerned that I can't represent the voices of all the other implementations (@LDeakin, thanks for joining in!) and that's the real purpose of ZEPs -- to make sure that you are taking the wider community along with you. |
FWIW, as an outsider the state of https://github.com/zarr-developers/zeps doesn't give me a ton of confidence that's it's a functioning system. That impression, plus the sentence in the spec saying that the metadata can't contain other fields (before I understood what this
Maybe do that for the existing spec, and we can incorporate it here? There won't be anything unique to consolidated metadata.
That sounds like a separate proposal.
As long as we're discussing the larger picture: keep in mind that this experience hasn't been especially friendly to a newcomer. I'm used to developing open-source projects and establishing consensus, so it's fine. But I could easily see other newcomers being pushed away, given a similar experience.
And again, this change is compatible with existing implementations. Who are the other implementers to reach out to? I see teams for |
My time to respond now is unfortunately limited since I'm getting to get on a bike for a week, so I'll focus on the most critical points:
Definitely a lot of that at the moment. See zarr-developers/governance#44 for more.
Very fair feedback.
I would hope that would have been more obvious starting from https://zarr.dev/zeps/active/ZEP0000.html |
Thanks, enjoy! |
This PR adds a new optional field to the core metadata for consolidating all the metadata of all child nodes under a single object. The motivation is similar to consolidated metadata in Zarr V2: without consolidated metadata, the time to load metadata for an entire hierarchy scales linearly with the number of nodes. This can be costly, especially for large hierarchies served HTTP from a remote storage system (like Blob Storage).
The primary points to discuss:
zarr.json
(kind="inline
).For *very* large hierarchies, this will bloat the size of the root
zarr.json`, slowing down operations that just want to open the metadata for the root.zarr.json
. That might be a nice alternative to consolidated metadata for very large hierarchies (you could do an initial read to get the list of nodes, and the load the metadata for each child concurrently).Closes #136
xref zarr-developers/zarr-python#2113, which is a completed implementation in zarr-python and LDeakin/zarrs#55, a draft implementation in zarrrs..