-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add spec document #1
Conversation
The goals of the specification are as follows: | ||
|
||
1. **Serializable isolation** - Reads will be isolated from concurrent writes and always use a committed snapshot of a dataset. Writes across multiple arrays and chunks will be commited via a single atomic operation and will not be partially visible. Readers will not acquire locks. | ||
2. **Chunk sharding and references** - Chunk storage is decoupled from specific file names. Multiple chunks can be packed into a single object (sharding). Zarr-compatible chunks within other file formats (e.g. HDF5, NetCDF) can be referenced. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fast data move operation should be mentioned, as part of 2 or as 3.
- Time travel
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you clarify what you mean by "data move"? Do you mean renaming groups and arrays?
The best analogy there with iceberg might be "schema evolution", i.e. adding, deleting, renaming columns. Do you think that's a useful concept to introduce?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, sorry, I mean renaming of both groups and arrays without the need to copy the chunks. I like the analogy with schema evolution because both are done mostly for the same reasons. You start with a model, it evolves, you learn your hierarchy would be better in some other form.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC Seba has consistently made the point that "read performance is prioritized over write performance." Is that right? If so, should it be mentioned here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
read performance is prioritized over write performance
I'm not sure I understand how that constraint has impacted our design. AFAICT Icechunk write performance is no worse than vanilla Zarr, modulo the overhead of managing the manifests etc. (which also applies to reads).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll have a write performance impact, given by the need for transactionality. This is expected I think, a price payed to get the feature. This cost will be somewhat linear with the size of the write, but with a small constant. Example, we'll need to rewrite manifests, adding changes.
path: string not null | ||
-- field metadata -- | ||
description: 'path to the node within the store' | ||
array_metadata: struct<shape: list<item: uint16> not null, data_type: string not null, fill_value: binary, dimension_names: list<item: string>, chunk_grid: struct<name: string not null, configuration: struct<chunk_shape: list<item: uint16> not null> not null>, chunk_key_encoding: struct<name: string not null, configuration: struct<separator: string not null> not null>, codecs: list<item: struct<name: string not null, configuration: binary>>> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For all array indexes we are currently using uint64. For the fill_value we'll use a Union array. For the chunk_grid, the current implementation is ignoring it and using just a normal uniform grid. Similar for the chunk key encoding, currently only recording one of the two values possible in zarr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uint64 makes sense for array indexes. But not for chunk indexes; the chunk grid is orders of magnitude smaller than the actual array grid. Given that we have to store a lot of chunk indexes in this spec, I feel like we should try to use the smallest thing possible. But maybe that's premature optimization.
For the chunk_grid, the current implementation is ignoring it and using just a normal uniform grid.
I agree that makes sense from an implementation point of view. From a spec point of view, I think it makes sense to store the actual official Zarr V3 metadata directly, rather than designing a new schema and having to convert between them. Are you ok with that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should call out that we are only intending to support Zarr v3 stores / metadata at this point. Else we need to make the structure file more flexible than is currently shown (or a separate v2 structure file).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uint64 makes sense for array indexes. But not for chunk indexes
kind of?
Think height=1 pancakes, in which the array's time grid and chunk grid go in parallel.
Regarding how to store the zarr array metadata ... we can discuss more about it. Initially, I'm storing the columns explicitly, not as a blob, I thought the advantages were enough to justify the extra work. But it's close.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seba, I think it makes sense to iterate on the details of these parts of the schema as you are implementing. We won't know the "right" choice until we actually code it up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely @rabernat ! That's the approach I'm taking, implementing the more basic encodings now, and iterate later once things work. This should be true for everything in this spec.
The latter requirement rules out structs or lists. | ||
|
||
So we opt for a variable length binary encoding. | ||
The chunk coord is created by encoding each element of the tuple a big endian `uint16` and then simply concatenating the bytes together in order. For the common case of arrays <= 4 dimensions, this would use 8 bytes or less per chunk coord. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uint16 can only hold < 10 years of an hourly dataset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's a different approach we could take to the coord index.
Rather than encoding the coord into a single field, we could define multiple optional int fields we could use as a multiindex, i.e.
array_id: uint32 not null
index0: uint32 not null
index1: uint32
index2: uint32
index3: uint32
index4: uint32
index5: uint32
index6: uint32
index7: uint32
inline_data: binary
chunk_id: binary
virtual_path: string
offset: uint64
length: uint32 not null
This would put a constraint on how many dimensions the chunk grid could support.
Some tests using pyarrow suggest that this is about 4x slower in terms of indexing than the binary encoding.
@rabernat answers on some of your comments/questions:
I think it's a useful level of indirection. It gives us more freed for things like expiration.
|
One minor issue is we don't have a dedicated place for that kind of thing. We would have to repeat this info in each row that links to the file. Which I suppose is OK if we are careful about using dictionary encoding or something like that. |
Thanks for all the feedback. Looking at my calendar, I expect to have time to respond and make a revision on Wednesday. |
I see what you mean. This is a consequence of using a denormalized schema. A more normalized schema would have a separate table for every single entity. We could use a sqlite file if we wanted to go that route. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was my first complete read of this document so I added comments throughout the entire thing. Most important right now is my question about the catalog and dataset discovery.
spec/icechunk_spec.md
Outdated
This specification describes a single Icechunk **dataset**. | ||
A dataset is defined as a Zarr store containing one or more interrelated Arrays and Groups which must be updated consistently. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just replace dataset
with store
? If that is what we mean, then I don't understand why we must add a layer of indirection from the start.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, in my mind store
and dataset
are equivalent. So you're saying you think we just just use store
? Fine with me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is what I'm saying.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I'd find that less confusing. I would also quote the Zarr definition of "store" to be quite explicit here.
path: string not null | ||
-- field metadata -- | ||
description: 'path to the node within the store' | ||
array_metadata: struct<shape: list<item: uint16> not null, data_type: string not null, fill_value: binary, dimension_names: list<item: string>, chunk_grid: struct<name: string not null, configuration: struct<chunk_shape: list<item: uint16> not null> not null>, chunk_key_encoding: struct<name: string not null, configuration: struct<separator: string not null> not null>, codecs: list<item: struct<name: string not null, configuration: binary>>> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should call out that we are only intending to support Zarr v3 stores / metadata at this point. Else we need to make the structure file more flexible than is currently shown (or a separate v2 structure file).
|
||
#### Chunk Coord Encoding | ||
|
||
Chunk coords are tuples of positive ints (e.g. `(5, 30, 10)`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few nits here:
a) remember that 0 is valid here
b) there has been talk of adding negative chunk coords to allow left appends - zarr-developers/zarr-specs#9
def delete_dataset(dataset_identifier) -> None: | ||
"""Remove a dataset from the catalog.""" | ||
... | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does the catalog support listing datasets? Or any other discovery mechanisms?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now I'm putting that in the category of "things that could be implemented by a catalog but aren't strictly required"
| id | YES | str UID | A unique identifier for the dataset | | ||
| generation | YES | int | An integer which must be incremented whenever the state file is updated | | ||
| store_root | NO | str | A URI which points to the root location of the store in object storage. If blank, the store root is assumed to be in the same directory as the state file itself. | | ||
| snapshots | YES | array[snapshot] | A list of all of the snapshots. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| snapshots | YES | array[snapshot] | A list of all of the snapshots. | | |
| snapshots | YES | array[Mapping] | A list of all of the snapshots (described below) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we want just snapshot IDs. There is potentially other metadata in this data structure (timestamp, parent snapshot, etc.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah my bad. I was confused by what snapshot
meant. I guess it's a Mapping so this is array[Mapping]
?
Co-authored-by: Deepak Cherian <[email protected]>
child 1, row: uint16 not null | ||
child 2, flags: uint16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we describe these two parameters please. I don't understand what they mean.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These come directly from Seba's Icechunk V2 Notes
My understanding is that row
is a pointer to the row in the attrs file where entries from this node start, while flags
is a customizable extension point. Maybe @paraseba can elaborate more?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want to know where to search for the metadata in a big attributes file, so we store the row to index to it in O(1). Flags is a way to know what to expect of a file before trying to parse it, so every time we have a pointer, we include a flags field that could tell you something like "this is the parquet format version 0.3, with support for X". I'm not loving this because the field is repeated in every row, which takes memory, but we can be cautious with the encoding. I'll think more about it once we get to implement these more detailed features.
| snapshots | YES | array[snapshot] | A list of all of the snapshots. | | ||
| refs | NO | mapping[reference] | A mapping of references (string names) to snapshots | | ||
|
||
A snapshot contains the following properties |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A snapshot contains the following properties | |
A snapshot is a Mapping that contains the following properties |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is a "Mapping"? Do you mean like a dict? This is JSON, so I think the correct entity type would be an Object.
Here is my attempt to synthesize the original Figma proposa with @paraseba's design notes.
In writing this up, of course I had more ideas for how to improve things. So I am proposing a few more changes, which I summarize below. I know this may be disruptive now that implementation has already started.
Proposed Changes
"Catalog" has been renamed to "State File"
State file now contains two arrays:
This is equivalent to what is supported by the iceberg REST catalog.
Structure file uses a much more nested schema than what @paraseba has proposed. This heavily leverages Arrow / Parquet nested structs and lists. Specifically, all of the array metadata lives within a big structure which is just null for groups. This felt like a much cleaner way to store both arrays and groups in the same table. I'm looking for detailed feedback on these schemas, particularly in light of the implementation work that has already been done._
I'm wondering if we can get away without a separate attrs file for now. This adds a lot of complexity without much immediate benefit.
Questions.
TODO