Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add spec document #1

Merged
merged 13 commits into from
Aug 9, 2024
Merged

Add spec document #1

merged 13 commits into from
Aug 9, 2024

Conversation

rabernat
Copy link
Contributor

@rabernat rabernat commented Aug 4, 2024

Here is my attempt to synthesize the original Figma proposa with @paraseba's design notes.

In writing this up, of course I had more ideas for how to improve things. So I am proposing a few more changes, which I summarize below. I know this may be disruptive now that implementation has already started.

Proposed Changes

  1. "Catalog" has been renamed to "State File"

  2. State file now contains two arrays:

    1. Snapshots - a list of all snapshots that have not been expired
    2. References - tags and branches

    This is equivalent to what is supported by the iceberg REST catalog.

  3. Structure file uses a much more nested schema than what @paraseba has proposed. This heavily leverages Arrow / Parquet nested structs and lists. Specifically, all of the array metadata lives within a big structure which is just null for groups. This felt like a much cleaner way to store both arrays and groups in the same table. I'm looking for detailed feedback on these schemas, particularly in light of the implementation work that has already been done._

  4. I'm wondering if we can get away without a separate attrs file for now. This adds a lot of complexity without much immediate benefit.

Questions.

  1. Should the state file explicitly store the file name of the structure file, or is this implicit on the snapshot ID? Similarly...
  2. Should the structure file explicitly store the file name of the manifest / inventory file, or is enough to store an ID.
  3. In the structure file reference to the manifest, do we need to store a row range (i_row_start, i_row_stop) rather than just a single row (i_row)? Won’t this help us subset the data better?
  4. In the structure file, should we store the size of the manifest file? That will help us know whether we should just eagerly download the whole thing or try to use some smarter strategy.
  5. In the manifest file, what is chunk_id? Is this the chunk filename? Can we call it chunk_file instead? What does a chunk need an ID?
  6. What is virtual path? Is it mutually exclusive with chunk_id?
  7. Do we need to store the array ID in the chunk manifest in order to be able to find it? Otherwise the information about which array a chunk belongs to is not self-contained within the manifest file but instead requires the structure file's i_row to disambiguate between multiple chunks with the same coord.
  8. Can we decide on either "inventory" or "manifest" and stick to it? I slightly favor "manifest" because this is already in use within the community, but I don't care deeply.

TODO

  • decide how to handle Enum types (e.g. group / array). Pyarrow doesn't support enum types, which prevented me from using them in this schema. But there's not reason we couldn't use them from Rust.
  • decide on encoding (e.g. JSON, msgpack, CBOR) of arbitrary unstructured data within Parquet files. This is needed for user attrs and also codec configuration.
  • document key algorithms
  • improve "high level goals" section
  • convert JSON tables to proper JSON schema

@rabernat rabernat requested review from paraseba and dcherian August 4, 2024 09:44
spec/icechunk_spec.md Outdated Show resolved Hide resolved
spec/icechunk_spec.md Outdated Show resolved Hide resolved
spec/icechunk_spec.md Outdated Show resolved Hide resolved
The goals of the specification are as follows:

1. **Serializable isolation** - Reads will be isolated from concurrent writes and always use a committed snapshot of a dataset. Writes across multiple arrays and chunks will be commited via a single atomic operation and will not be partially visible. Readers will not acquire locks.
2. **Chunk sharding and references** - Chunk storage is decoupled from specific file names. Multiple chunks can be packed into a single object (sharding). Zarr-compatible chunks within other file formats (e.g. HDF5, NetCDF) can be referenced.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fast data move operation should be mentioned, as part of 2 or as 3.

  1. Time travel

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify what you mean by "data move"? Do you mean renaming groups and arrays?

The best analogy there with iceberg might be "schema evolution", i.e. adding, deleting, renaming columns. Do you think that's a useful concept to introduce?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sorry, I mean renaming of both groups and arrays without the need to copy the chunks. I like the analogy with schema evolution because both are done mostly for the same reasons. You start with a model, it evolves, you learn your hierarchy would be better in some other form.

Copy link
Contributor

@dcherian dcherian Aug 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC Seba has consistently made the point that "read performance is prioritized over write performance." Is that right? If so, should it be mentioned here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

read performance is prioritized over write performance

I'm not sure I understand how that constraint has impacted our design. AFAICT Icechunk write performance is no worse than vanilla Zarr, modulo the overhead of managing the manifests etc. (which also applies to reads).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll have a write performance impact, given by the need for transactionality. This is expected I think, a price payed to get the feature. This cost will be somewhat linear with the size of the write, but with a small constant. Example, we'll need to rewrite manifests, adding changes.

spec/icechunk_spec.md Outdated Show resolved Hide resolved
path: string not null
-- field metadata --
description: 'path to the node within the store'
array_metadata: struct<shape: list<item: uint16> not null, data_type: string not null, fill_value: binary, dimension_names: list<item: string>, chunk_grid: struct<name: string not null, configuration: struct<chunk_shape: list<item: uint16> not null> not null>, chunk_key_encoding: struct<name: string not null, configuration: struct<separator: string not null> not null>, codecs: list<item: struct<name: string not null, configuration: binary>>>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For all array indexes we are currently using uint64. For the fill_value we'll use a Union array. For the chunk_grid, the current implementation is ignoring it and using just a normal uniform grid. Similar for the chunk key encoding, currently only recording one of the two values possible in zarr.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uint64 makes sense for array indexes. But not for chunk indexes; the chunk grid is orders of magnitude smaller than the actual array grid. Given that we have to store a lot of chunk indexes in this spec, I feel like we should try to use the smallest thing possible. But maybe that's premature optimization.

For the chunk_grid, the current implementation is ignoring it and using just a normal uniform grid.

I agree that makes sense from an implementation point of view. From a spec point of view, I think it makes sense to store the actual official Zarr V3 metadata directly, rather than designing a new schema and having to convert between them. Are you ok with that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should call out that we are only intending to support Zarr v3 stores / metadata at this point. Else we need to make the structure file more flexible than is currently shown (or a separate v2 structure file).

Copy link
Contributor

@paraseba paraseba Aug 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uint64 makes sense for array indexes. But not for chunk indexes
kind of?

Think height=1 pancakes, in which the array's time grid and chunk grid go in parallel.

Regarding how to store the zarr array metadata ... we can discuss more about it. Initially, I'm storing the columns explicitly, not as a blob, I thought the advantages were enough to justify the extra work. But it's close.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seba, I think it makes sense to iterate on the details of these parts of the schema as you are implementing. We won't know the "right" choice until we actually code it up.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely @rabernat ! That's the approach I'm taking, implementing the more basic encodings now, and iterate later once things work. This should be true for everything in this spec.

spec/icechunk_spec.md Outdated Show resolved Hide resolved
spec/icechunk_spec.md Outdated Show resolved Hide resolved
spec/icechunk_spec.md Outdated Show resolved Hide resolved
The latter requirement rules out structs or lists.

So we opt for a variable length binary encoding.
The chunk coord is created by encoding each element of the tuple a big endian `uint16` and then simply concatenating the bytes together in order. For the common case of arrays <= 4 dimensions, this would use 8 bytes or less per chunk coord.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uint16 can only hold < 10 years of an hourly dataset.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a different approach we could take to the coord index.

Rather than encoding the coord into a single field, we could define multiple optional int fields we could use as a multiindex, i.e.

array_id: uint32 not null
index0: uint32 not null
index1: uint32
index2: uint32
index3: uint32
index4: uint32
index5: uint32
index6: uint32
index7: uint32
inline_data: binary
chunk_id: binary
virtual_path: string
offset: uint64
length: uint32 not null

This would put a constraint on how many dimensions the chunk grid could support.

Some tests using pyarrow suggest that this is about 4x slower in terms of indexing than the binary encoding.

@paraseba
Copy link
Contributor

paraseba commented Aug 5, 2024

@rabernat answers on some of your comments/questions:

  • for the nesting of the columns, yes. I'm experimenting with that because the library has rough edges. Some of the columns (like the ones for manifests) I'm putting nested into structures. It's going OK, but it's not any less work than fully expanding. I still think it's better to nest, because the resulting file is more compact, but there is no real advantage in code, storage, or anything. Just syntax sugar.
  • We are not going to implement attribute files in the first iteration, but I think they are important because they make for a better spec. Some use cases could call for very large user attributes, queried very infrequently, so configuring the writer to have very small inline metadata could be useful. I'd keep them in the spec for now, and see how it goes.
  • Should the catalog explicitly store the file name of the structure file

I think it's a useful level of indirection. It gives us more freed for things like expiration.

  • Should the structure file explicitly store the file name of the manifest / inventory file, or is enough to store an ID.

    ID is enough

  • In the structure file, should we store the size of the manifest file? That will help us know whether we should just eagerly download the whole thing or try to use some smarter strategy.

    Good idea, this is the kind of thing I was reserving the "extra" field for, but this one feels like an optimization that the open source client can make so it deserves its own column. We'll add it in a second pass to the implementation.

  • In the manifest file, what is chunk_id? Is this the chunk filename? Can we call it chunk_file instead? What does a chunk need an ID?

    see my comment in your PR

  • What is virtual path? Is it mutually exclusive with chunk_id?

    for virtual chunks we need the full path, for materialized chunks is enough with a short binary id

  • Do we need to store the array ID in the chunk manifest

    Yes, it's what I call id in the design notes, array_id would be a better name

  • Can we decide on either "inventory" or "manifest" and stick to it?

    Yes, manifest it is. I only switched to inventory temporary because i was using metadata instead of user attributes and i had a conflict with the initial letter

@paraseba
Copy link
Contributor

paraseba commented Aug 5, 2024

In the structure file, should we store the size of the manifest file?

One minor issue is we don't have a dedicated place for that kind of thing. We would have to repeat this info in each row that links to the file. Which I suppose is OK if we are careful about using dictionary encoding or something like that.

@rabernat
Copy link
Contributor Author

rabernat commented Aug 5, 2024

Thanks for all the feedback. Looking at my calendar, I expect to have time to respond and make a revision on Wednesday.

spec/icechunk_spec.md Outdated Show resolved Hide resolved
@rabernat
Copy link
Contributor Author

rabernat commented Aug 8, 2024

We would have to repeat this info in each row that links to the file. Which I suppose is OK if we are careful about using dictionary encoding or something like that.

I see what you mean. This is a consequence of using a denormalized schema. A more normalized schema would have a separate table for every single entity.

We could use a sqlite file if we wanted to go that route.

Copy link
Member

@jhamman jhamman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was my first complete read of this document so I added comments throughout the entire thing. Most important right now is my question about the catalog and dataset discovery.

Comment on lines 6 to 7
This specification describes a single Icechunk **dataset**.
A dataset is defined as a Zarr store containing one or more interrelated Arrays and Groups which must be updated consistently.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just replace dataset with store? If that is what we mean, then I don't understand why we must add a layer of indirection from the start.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in my mind store and dataset are equivalent. So you're saying you think we just just use store? Fine with me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is what I'm saying.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I'd find that less confusing. I would also quote the Zarr definition of "store" to be quite explicit here.

spec/icechunk_spec.md Outdated Show resolved Hide resolved
spec/icechunk_spec.md Show resolved Hide resolved
spec/icechunk_spec.md Outdated Show resolved Hide resolved
path: string not null
-- field metadata --
description: 'path to the node within the store'
array_metadata: struct<shape: list<item: uint16> not null, data_type: string not null, fill_value: binary, dimension_names: list<item: string>, chunk_grid: struct<name: string not null, configuration: struct<chunk_shape: list<item: uint16> not null> not null>, chunk_key_encoding: struct<name: string not null, configuration: struct<separator: string not null> not null>, codecs: list<item: struct<name: string not null, configuration: binary>>>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should call out that we are only intending to support Zarr v3 stores / metadata at this point. Else we need to make the structure file more flexible than is currently shown (or a separate v2 structure file).


#### Chunk Coord Encoding

Chunk coords are tuples of positive ints (e.g. `(5, 30, 10)`).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few nits here:

a) remember that 0 is valid here
b) there has been talk of adding negative chunk coords to allow left appends - zarr-developers/zarr-specs#9

spec/icechunk_spec.md Outdated Show resolved Hide resolved
def delete_dataset(dataset_identifier) -> None:
"""Remove a dataset from the catalog."""
...
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does the catalog support listing datasets? Or any other discovery mechanisms?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now I'm putting that in the category of "things that could be implemented by a catalog but aren't strictly required"

spec/icechunk_spec.md Outdated Show resolved Hide resolved
spec/icechunk_spec.md Outdated Show resolved Hide resolved
spec/icechunk_spec.md Outdated Show resolved Hide resolved
spec/icechunk_spec.md Outdated Show resolved Hide resolved
spec/icechunk_spec.md Outdated Show resolved Hide resolved
spec/icechunk_spec.md Outdated Show resolved Hide resolved
spec/icechunk_spec.md Outdated Show resolved Hide resolved
spec/icechunk_spec.md Outdated Show resolved Hide resolved
spec/icechunk_spec.md Outdated Show resolved Hide resolved
| id | YES | str UID | A unique identifier for the dataset |
| generation | YES | int | An integer which must be incremented whenever the state file is updated |
| store_root | NO | str | A URI which points to the root location of the store in object storage. If blank, the store root is assumed to be in the same directory as the state file itself. |
| snapshots | YES | array[snapshot] | A list of all of the snapshots. |
Copy link
Contributor

@dcherian dcherian Aug 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| snapshots | YES | array[snapshot] | A list of all of the snapshots. |
| snapshots | YES | array[Mapping] | A list of all of the snapshots (described below) |

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we want just snapshot IDs. There is potentially other metadata in this data structure (timestamp, parent snapshot, etc.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah my bad. I was confused by what snapshot meant. I guess it's a Mapping so this is array[Mapping]?

spec/icechunk_spec.md Outdated Show resolved Hide resolved
spec/icechunk_spec.md Outdated Show resolved Hide resolved
Comment on lines +185 to +186
child 1, row: uint16 not null
child 2, flags: uint16
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we describe these two parameters please. I don't understand what they mean.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These come directly from Seba's Icechunk V2 Notes

My understanding is that row is a pointer to the row in the attrs file where entries from this node start, while flags is a customizable extension point. Maybe @paraseba can elaborate more?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to know where to search for the metadata in a big attributes file, so we store the row to index to it in O(1). Flags is a way to know what to expect of a file before trying to parse it, so every time we have a pointer, we include a flags field that could tell you something like "this is the parquet format version 0.3, with support for X". I'm not loving this because the field is repeated in every row, which takes memory, but we can be cautious with the encoding. I'll think more about it once we get to implement these more detailed features.

| snapshots | YES | array[snapshot] | A list of all of the snapshots. |
| refs | NO | mapping[reference] | A mapping of references (string names) to snapshots |

A snapshot contains the following properties
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
A snapshot contains the following properties
A snapshot is a Mapping that contains the following properties

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is a "Mapping"? Do you mean like a dict? This is JSON, so I think the correct entity type would be an Object.

spec/icechunk_spec.md Outdated Show resolved Hide resolved
@rabernat rabernat merged commit 8897053 into main Aug 9, 2024
1 check failed
@rabernat rabernat deleted the ryan/spec-draft branch August 9, 2024 21:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants