Add spec document #1

rabernat · 2024-08-04T09:44:39Z

Here is my attempt to synthesize the original Figma proposa with @paraseba's design notes.

In writing this up, of course I had more ideas for how to improve things. So I am proposing a few more changes, which I summarize below. I know this may be disruptive now that implementation has already started.

Proposed Changes

"Catalog" has been renamed to "State File"
State file now contains two arrays:
1. Snapshots - a list of all snapshots that have not been expired
2. References - tags and branches
This is equivalent to what is supported by the iceberg REST catalog.
Structure file uses a much more nested schema than what @paraseba has proposed. This heavily leverages Arrow / Parquet nested structs and lists. Specifically, all of the array metadata lives within a big structure which is just null for groups. This felt like a much cleaner way to store both arrays and groups in the same table. I'm looking for detailed feedback on these schemas, particularly in light of the implementation work that has already been done._
I'm wondering if we can get away without a separate attrs file for now. This adds a lot of complexity without much immediate benefit.

Questions.

Should the state file explicitly store the file name of the structure file, or is this implicit on the snapshot ID? Similarly...
Should the structure file explicitly store the file name of the manifest / inventory file, or is enough to store an ID.
In the structure file reference to the manifest, do we need to store a row range (i_row_start, i_row_stop) rather than just a single row (i_row)? Won’t this help us subset the data better?
In the structure file, should we store the size of the manifest file? That will help us know whether we should just eagerly download the whole thing or try to use some smarter strategy.
In the manifest file, what is chunk_id? Is this the chunk filename? Can we call it chunk_file instead? What does a chunk need an ID?
What is virtual path? Is it mutually exclusive with chunk_id?
Do we need to store the array ID in the chunk manifest in order to be able to find it? Otherwise the information about which array a chunk belongs to is not self-contained within the manifest file but instead requires the structure file's i_row to disambiguate between multiple chunks with the same coord.
Can we decide on either "inventory" or "manifest" and stick to it? I slightly favor "manifest" because this is already in use within the community, but I don't care deeply.

TODO

decide how to handle Enum types (e.g. group / array). Pyarrow doesn't support enum types, which prevented me from using them in this schema. But there's not reason we couldn't use them from Rust.
decide on encoding (e.g. JSON, msgpack, CBOR) of arbitrary unstructured data within Parquet files. This is needed for user attrs and also codec configuration.
document key algorithms
improve "high level goals" section
convert JSON tables to proper JSON schema

spec/icechunk_spec.md

paraseba · 2024-08-05T15:43:14Z

spec/icechunk_spec.md

+The goals of the specification are as follows:
+
+1. **Serializable isolation** - Reads will be isolated from concurrent writes and always use a committed snapshot of a dataset. Writes across multiple arrays and chunks will be commited via a single atomic operation and will not be partially visible. Readers will not acquire locks.
+2. **Chunk sharding and references** - Chunk storage is decoupled from specific file names. Multiple chunks can be packed into a single object (sharding). Zarr-compatible chunks within other file formats (e.g. HDF5, NetCDF) can be referenced.


Fast data move operation should be mentioned, as part of 2 or as 3.

Time travel

Can you clarify what you mean by "data move"? Do you mean renaming groups and arrays?

The best analogy there with iceberg might be "schema evolution", i.e. adding, deleting, renaming columns. Do you think that's a useful concept to introduce?

Yes, sorry, I mean renaming of both groups and arrays without the need to copy the chunks. I like the analogy with schema evolution because both are done mostly for the same reasons. You start with a model, it evolves, you learn your hierarchy would be better in some other form.

IIUC Seba has consistently made the point that "read performance is prioritized over write performance." Is that right? If so, should it be mentioned here?

read performance is prioritized over write performance

I'm not sure I understand how that constraint has impacted our design. AFAICT Icechunk write performance is no worse than vanilla Zarr, modulo the overhead of managing the manifests etc. (which also applies to reads).

We'll have a write performance impact, given by the need for transactionality. This is expected I think, a price payed to get the feature. This cost will be somewhat linear with the size of the write, but with a small constant. Example, we'll need to rewrite manifests, adding changes.

spec/icechunk_spec.md

paraseba · 2024-08-05T17:43:44Z

spec/icechunk_spec.md

+path: string not null
+  -- field metadata --
+  description: 'path to the node within the store'
+array_metadata: struct<shape: list<item: uint16> not null, data_type: string not null, fill_value: binary, dimension_names: list<item: string>, chunk_grid: struct<name: string not null, configuration: struct<chunk_shape: list<item: uint16> not null> not null>, chunk_key_encoding: struct<name: string not null, configuration: struct<separator: string not null> not null>, codecs: list<item: struct<name: string not null, configuration: binary>>>


For all array indexes we are currently using uint64. For the fill_value we'll use a Union array. For the chunk_grid, the current implementation is ignoring it and using just a normal uniform grid. Similar for the chunk key encoding, currently only recording one of the two values possible in zarr.

uint64 makes sense for array indexes. But not for chunk indexes; the chunk grid is orders of magnitude smaller than the actual array grid. Given that we have to store a lot of chunk indexes in this spec, I feel like we should try to use the smallest thing possible. But maybe that's premature optimization.

For the chunk_grid, the current implementation is ignoring it and using just a normal uniform grid.

I agree that makes sense from an implementation point of view. From a spec point of view, I think it makes sense to store the actual official Zarr V3 metadata directly, rather than designing a new schema and having to convert between them. Are you ok with that?

We should call out that we are only intending to support Zarr v3 stores / metadata at this point. Else we need to make the structure file more flexible than is currently shown (or a separate v2 structure file).

uint64 makes sense for array indexes. But not for chunk indexes
kind of?

Think height=1 pancakes, in which the array's time grid and chunk grid go in parallel.

Regarding how to store the zarr array metadata ... we can discuss more about it. Initially, I'm storing the columns explicitly, not as a blob, I thought the advantages were enough to justify the extra work. But it's close.

Seba, I think it makes sense to iterate on the details of these parts of the schema as you are implementing. We won't know the "right" choice until we actually code it up.

Absolutely @rabernat ! That's the approach I'm taking, implementing the more basic encodings now, and iterate later once things work. This should be true for everything in this spec.

spec/icechunk_spec.md

paraseba · 2024-08-05T17:50:07Z

spec/icechunk_spec.md

+The latter requirement rules out structs or lists.
+
+So we opt for a variable length binary encoding.
+The chunk coord is created by encoding each element of the tuple a big endian `uint16` and then simply concatenating the bytes together in order. For the common case of arrays <= 4 dimensions, this would use 8 bytes or less per chunk coord.


uint16 can only hold < 10 years of an hourly dataset.

Here's a different approach we could take to the coord index.

Rather than encoding the coord into a single field, we could define multiple optional int fields we could use as a multiindex, i.e.

array_id: uint32 not null index0: uint32 not null index1: uint32 index2: uint32 index3: uint32 index4: uint32 index5: uint32 index6: uint32 index7: uint32 inline_data: binary chunk_id: binary virtual_path: string offset: uint64 length: uint32 not null

This would put a constraint on how many dimensions the chunk grid could support.

Some tests using pyarrow suggest that this is about 4x slower in terms of indexing than the binary encoding.

paraseba · 2024-08-05T18:02:29Z

@rabernat answers on some of your comments/questions:

for the nesting of the columns, yes. I'm experimenting with that because the library has rough edges. Some of the columns (like the ones for manifests) I'm putting nested into structures. It's going OK, but it's not any less work than fully expanding. I still think it's better to nest, because the resulting file is more compact, but there is no real advantage in code, storage, or anything. Just syntax sugar.
We are not going to implement attribute files in the first iteration, but I think they are important because they make for a better spec. Some use cases could call for very large user attributes, queried very infrequently, so configuring the writer to have very small inline metadata could be useful. I'd keep them in the spec for now, and see how it goes.
Should the catalog explicitly store the file name of the structure file

I think it's a useful level of indirection. It gives us more freed for things like expiration.

Should the structure file explicitly store the file name of the manifest / inventory file, or is enough to store an ID.

ID is enough
In the structure file, should we store the size of the manifest file? That will help us know whether we should just eagerly download the whole thing or try to use some smarter strategy.

Good idea, this is the kind of thing I was reserving the "extra" field for, but this one feels like an optimization that the open source client can make so it deserves its own column. We'll add it in a second pass to the implementation.
In the manifest file, what is chunk_id? Is this the chunk filename? Can we call it chunk_file instead? What does a chunk need an ID?

see my comment in your PR
What is virtual path? Is it mutually exclusive with chunk_id?

for virtual chunks we need the full path, for materialized chunks is enough with a short binary id
Do we need to store the array ID in the chunk manifest

Yes, it's what I call id in the design notes, array_id would be a better name
Can we decide on either "inventory" or "manifest" and stick to it?

Yes, manifest it is. I only switched to inventory temporary because i was using metadata instead of user attributes and i had a conflict with the initial letter

paraseba · 2024-08-05T18:55:59Z

In the structure file, should we store the size of the manifest file?

One minor issue is we don't have a dedicated place for that kind of thing. We would have to repeat this info in each row that links to the file. Which I suppose is OK if we are careful about using dictionary encoding or something like that.

rabernat · 2024-08-05T19:43:45Z

Thanks for all the feedback. Looking at my calendar, I expect to have time to respond and make a revision on Wednesday.

spec/icechunk_spec.md

rabernat · 2024-08-08T15:19:57Z

We would have to repeat this info in each row that links to the file. Which I suppose is OK if we are careful about using dictionary encoding or something like that.

I see what you mean. This is a consequence of using a denormalized schema. A more normalized schema would have a separate table for every single entity.

We could use a sqlite file if we wanted to go that route.

jhamman

This was my first complete read of this document so I added comments throughout the entire thing. Most important right now is my question about the catalog and dataset discovery.

jhamman · 2024-08-08T15:12:00Z

spec/icechunk_spec.md

+This specification describes a single Icechunk **dataset**.
+A dataset is defined as a Zarr store containing one or more interrelated Arrays and Groups which must be updated consistently.


Can we just replace dataset with store? If that is what we mean, then I don't understand why we must add a layer of indirection from the start.

Yes, in my mind store and dataset are equivalent. So you're saying you think we just just use store? Fine with me.

Yes, that is what I'm saying.

Yeah I'd find that less confusing. I would also quote the Zarr definition of "store" to be quite explicit here.

spec/icechunk_spec.md

jhamman · 2024-08-08T15:35:10Z

spec/icechunk_spec.md

+path: string not null
+  -- field metadata --
+  description: 'path to the node within the store'
+array_metadata: struct<shape: list<item: uint16> not null, data_type: string not null, fill_value: binary, dimension_names: list<item: string>, chunk_grid: struct<name: string not null, configuration: struct<chunk_shape: list<item: uint16> not null> not null>, chunk_key_encoding: struct<name: string not null, configuration: struct<separator: string not null> not null>, codecs: list<item: struct<name: string not null, configuration: binary>>>


We should call out that we are only intending to support Zarr v3 stores / metadata at this point. Else we need to make the structure file more flexible than is currently shown (or a separate v2 structure file).

jhamman · 2024-08-08T15:39:27Z

spec/icechunk_spec.md

+
+#### Chunk Coord Encoding
+
+Chunk coords are tuples of positive ints (e.g. `(5, 30, 10)`).


A few nits here:

a) remember that 0 is valid here
b) there has been talk of adding negative chunk coords to allow left appends - zarr-developers/zarr-specs#9

spec/icechunk_spec.md

jhamman · 2024-08-08T15:46:33Z

spec/icechunk_spec.md

+def delete_dataset(dataset_identifier) -> None:
+    """Remove a dataset from the catalog."""
+    ...
+```


does the catalog support listing datasets? Or any other discovery mechanisms?

Right now I'm putting that in the category of "things that could be implemented by a catalog but aren't strictly required"

spec/icechunk_spec.md

dcherian · 2024-08-08T20:45:07Z

spec/icechunk_spec.md

+| id | YES | str UID | A unique identifier for the dataset |
+| generation | YES | int | An integer which must be incremented whenever the state file is updated |
+| store_root | NO | str | A URI which points to the root location of the store in object storage. If blank, the store root is assumed to be in the same directory as the state file itself. | 
+| snapshots | YES | array[snapshot] | A list of all of the snapshots. |


I'm not sure we want just snapshot IDs. There is potentially other metadata in this data structure (timestamp, parent snapshot, etc.)

Ah my bad. I was confused by what snapshot meant. I guess it's a Mapping so this is array[Mapping]?

spec/icechunk_spec.md

Co-authored-by: Deepak Cherian <[email protected]>

dcherian · 2024-08-08T21:05:09Z

spec/icechunk_spec.md

+  child 1, row: uint16 not null
+  child 2, flags: uint16


Can we describe these two parameters please. I don't understand what they mean.

These come directly from Seba's Icechunk V2 Notes

My understanding is that row is a pointer to the row in the attrs file where entries from this node start, while flags is a customizable extension point. Maybe @paraseba can elaborate more?

We want to know where to search for the metadata in a big attributes file, so we store the row to index to it in O(1). Flags is a way to know what to expect of a file before trying to parse it, so every time we have a pointer, we include a flags field that could tell you something like "this is the parquet format version 0.3, with support for X". I'm not loving this because the field is repeated in every row, which takes memory, but we can be cautious with the encoding. I'll think more about it once we get to implement these more detailed features.

dcherian · 2024-08-08T21:17:19Z

spec/icechunk_spec.md

+| snapshots | YES | array[snapshot] | A list of all of the snapshots. |
+| refs | NO | mapping[reference] | A mapping of references (string names) to snapshots |
+
+A snapshot contains the following properties


Suggested change

A snapshot contains the following properties

A snapshot is a Mapping that contains the following properties

What is a "Mapping"? Do you mean like a dict? This is JSON, so I think the correct entity type would be an Object.

spec/icechunk_spec.md

rabernat added 2 commits August 4, 2024 05:13

add spec doc

8673038

add part about chunk coord encoding

5e53936

rabernat requested review from paraseba and dcherian August 4, 2024 09:44

paraseba reviewed Aug 5, 2024

View reviewed changes

rabernat commented Aug 7, 2024

View reviewed changes

spec/icechunk_spec.md Outdated Show resolved Hide resolved

rabernat added 3 commits August 7, 2024 12:22

Update spec/icechunk_spec.md

bba3509

add catalog to comparison

a77bb67

updates from Seba's review

e1690eb

add catalog details

6a38427

jhamman reviewed Aug 8, 2024

View reviewed changes

rabernat commented Aug 8, 2024

View reviewed changes

spec/icechunk_spec.md Outdated Show resolved Hide resolved

rabernat commented Aug 8, 2024

View reviewed changes

spec/icechunk_spec.md Outdated Show resolved Hide resolved

dcherian reviewed Aug 8, 2024

View reviewed changes

spec/icechunk_spec.md Outdated Show resolved Hide resolved

dcherian reviewed Aug 8, 2024

View reviewed changes

spec/icechunk_spec.md Outdated Show resolved Hide resolved

dcherian reviewed Aug 8, 2024

View reviewed changes

spec/icechunk_spec.md Outdated Show resolved Hide resolved

dcherian reviewed Aug 8, 2024

View reviewed changes

spec/icechunk_spec.md Outdated Show resolved Hide resolved

dcherian reviewed Aug 8, 2024

View reviewed changes

spec/icechunk_spec.md Outdated Show resolved Hide resolved

dcherian reviewed Aug 8, 2024

View reviewed changes

spec/icechunk_spec.md Outdated Show resolved Hide resolved

dcherian reviewed Aug 8, 2024

View reviewed changes

spec/icechunk_spec.md Outdated Show resolved Hide resolved

dcherian reviewed Aug 8, 2024

View reviewed changes

spec/icechunk_spec.md Outdated Show resolved Hide resolved

dcherian reviewed Aug 8, 2024

View reviewed changes

spec/icechunk_spec.md Outdated Show resolved Hide resolved

rabernat and others added 3 commits August 8, 2024 16:58

replace Dataset with Store; add goals and non goals

c6640b3

more review comments

03a329f

Apply suggestions from code review

ad73733

Co-authored-by: Deepak Cherian <[email protected]>

Merge remote-tracking branch 'origin' into ryan/spec-draft

e57c441

dcherian reviewed Aug 8, 2024

View reviewed changes

rabernat added 2 commits August 8, 2024 17:07

merged

14a4e66

rename dataset -> store again

2e998a1

dcherian reviewed Aug 8, 2024

View reviewed changes

paraseba reviewed Aug 9, 2024

View reviewed changes

spec/icechunk_spec.md Outdated Show resolved Hide resolved

paraseba reviewed Aug 9, 2024

View reviewed changes

spec/icechunk_spec.md Show resolved Hide resolved

closes #6

202de3a

rabernat merged commit 8897053 into main Aug 9, 2024
1 check failed

rabernat deleted the ryan/spec-draft branch August 9, 2024 21:36

		This specification describes a single Icechunk dataset.
		A dataset is defined as a Zarr store containing one or more interrelated Arrays and Groups which must be updated consistently.


		#### Chunk Coord Encoding

		Chunk coords are tuples of positive ints (e.g. `(5, 30, 10)`).

	\| snapshots \| YES \| array[snapshot] \| A list of all of the snapshots. \|
	\| snapshots \| YES \| array[Mapping] \| A list of all of the snapshots (described below) \|

	A snapshot contains the following properties
	A snapshot is a Mapping that contains the following properties

Add spec document #1

Add spec document #1

Conversation

rabernat commented Aug 4, 2024 • edited Loading

Proposed Changes

Questions.

TODO

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dcherian Aug 8, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paraseba Aug 8, 2024 • edited by dcherian Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paraseba commented Aug 5, 2024

paraseba commented Aug 5, 2024

rabernat commented Aug 5, 2024

rabernat commented Aug 8, 2024

jhamman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dcherian Aug 8, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rabernat commented Aug 4, 2024 •

edited

Loading

dcherian Aug 8, 2024 •

edited

Loading

paraseba Aug 8, 2024 •

edited by dcherian

Loading

dcherian Aug 8, 2024 •

edited

Loading