Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add spec document #1

Merged
merged 13 commits into from
Aug 9, 2024
272 changes: 272 additions & 0 deletions spec/icechunk_spec.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,272 @@
# Icechunk Specification

The Icechunk specification is a storage specification for [Zarr](https://zarr-specs.readthedocs.io/en/latest/specs.html) data.
Icechunk is inspired by Apache Iceberg and borrows many concepts and ideas from the [Iceberg Spec](https://iceberg.apache.org/spec/#version-2-row-level-deletes).

This specification describes a single Icechunk **dataset**.
A dataset is defined as a Zarr store containing one or more interrelated Arrays and Groups which must be updated consistently.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just replace dataset with store? If that is what we mean, then I don't understand why we must add a layer of indirection from the start.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in my mind store and dataset are equivalent. So you're saying you think we just just use store? Fine with me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is what I'm saying.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I'd find that less confusing. I would also quote the Zarr definition of "store" to be quite explicit here.

The most common scenarios is for a dataset to contain a single Zarr group with multiple arrays, each corresponding to different physical variables but sharing common spatiotemporal coordinates.
rabernat marked this conversation as resolved.
Show resolved Hide resolved
rabernat marked this conversation as resolved.
Show resolved Hide resolved

## Comparison with Iceberg
rabernat marked this conversation as resolved.
Show resolved Hide resolved

| Iceberg Entity | Icechunk Entity |
|--|--|
| Table | Dataset |
| Column | Array |
| Catalog | State File |
rabernat marked this conversation as resolved.
Show resolved Hide resolved
| Snapshot | Snapshot |

## Goals

The goals of the specification are as follows:

1. **Serializable isolation** - Reads will be isolated from concurrent writes and always use a committed snapshot of a dataset. Writes across multiple arrays and chunks will be commited via a single atomic operation and will not be partially visible. Readers will not acquire locks.
rabernat marked this conversation as resolved.
Show resolved Hide resolved
2. **Chunk sharding and references** - Chunk storage is decoupled from specific file names. Multiple chunks can be packed into a single object (sharding). Zarr-compatible chunks within other file formats (e.g. HDF5, NetCDF) can be referenced.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fast data move operation should be mentioned, as part of 2 or as 3.

  1. Time travel

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify what you mean by "data move"? Do you mean renaming groups and arrays?

The best analogy there with iceberg might be "schema evolution", i.e. adding, deleting, renaming columns. Do you think that's a useful concept to introduce?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sorry, I mean renaming of both groups and arrays without the need to copy the chunks. I like the analogy with schema evolution because both are done mostly for the same reasons. You start with a model, it evolves, you learn your hierarchy would be better in some other form.

Copy link
Contributor

@dcherian dcherian Aug 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC Seba has consistently made the point that "read performance is prioritized over write performance." Is that right? If so, should it be mentioned here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

read performance is prioritized over write performance

I'm not sure I understand how that constraint has impacted our design. AFAICT Icechunk write performance is no worse than vanilla Zarr, modulo the overhead of managing the manifests etc. (which also applies to reads).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll have a write performance impact, given by the need for transactionality. This is expected I think, a price payed to get the feature. This cost will be somewhat linear with the size of the write, but with a small constant. Example, we'll need to rewrite manifests, adding changes.


[TODO: there must be more, but these seem like the big ones for now]
rabernat marked this conversation as resolved.
Show resolved Hide resolved

### Filesytem Operations

The required filesystem operations are identical to Iceberg. Icechunk only requires that file systems support the following operations:
rabernat marked this conversation as resolved.
Show resolved Hide resolved

- **In-place write** - Files are not moved or altered once they are written.
- **Seekable reads** - Chunk file formats may require seek support (e.g. shards).
- **Deletes** - Datasets delete files that are no longer used (via a garbage-collection operation).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strong read-after-write consistency is also required.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

list operation could be required for value-add operations like GC and compaction.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strong read-after-write consistency is also required.

Why? Isn't that only for the catalog? I'd like to keep catalog requirements separate if possible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, even for the pre-commit operation. We need the ability to read a chunk we just wrote in the same writer. Without read-after-write this goes out of the window and things become way more complicated. Commit is even worse.


These requirements are compatible with object stores, like S3.
rabernat marked this conversation as resolved.
Show resolved Hide resolved

Datasets do not require random-access writes. Once written, chunk and metadata files are immutable until they are deleted.

## Specification

### Overview

Like Iceberg, Icechunk uses a series of linked metadata files to describe the state of the dataset.

- The **state file** is the entry point to the dataset. It stores a record of snapshots, each of which is a pointer to a single structure file.
- The **structure file** records all of the different arrays and groups in the dataset, plus their metadata. Every new commit creates a new structure file. The structure file contains pointers to one or more chunk manifests files and [optionally] attribute files.
- **Chunk Manifests** store references to individual chunks.
dcherian marked this conversation as resolved.
Show resolved Hide resolved
rabernat marked this conversation as resolved.
Show resolved Hide resolved
- **Attributes files** provide a way to store additional user-defined attributes for arrays and groups outside of the structure file.
rabernat marked this conversation as resolved.
Show resolved Hide resolved
- **Chunk files** store the actual compressed chunk data.
rabernat marked this conversation as resolved.
Show resolved Hide resolved

When reading a dataset, the client first open the state file and chooses a specific snapshot to open.
The client then reads the structure file to determine the structure and hierarchy of the dataset.
When fetching data from an array, the client first examines the chunk manifest file[s] for that array and finally fetches the chunks referenced therein.

When writing a new dataset snapshot, the client first writes a new set of chunks and chunk manifests, and then generates a new structure file. Finally, in an atomic swap operation, it replaces the state file with a new state file recording the presence of the new snapshot .
rabernat marked this conversation as resolved.
Show resolved Hide resolved


```mermaid
flowchart TD
subgraph catalog
state[State File]
end
subgraph metadata
subgraph structure
structure1[Structure File 1]
structure2[Structure File 2]
end
subgraph attributes
attrs[Attribute File]
end
subgraph manifests
manifestA[Chunk Manifest A]
manifestB[Chunk Manifest B]
end
end
subgraph data
chunk1[Chunk File 1]
chunk2[Chunk File 2]
chunk3[Chunk File 3]
chunk4[Chunk File 4]
end

state -- snapshot ID --> structure2
structure1 --> attrs
structure1 --> manifestA
structure2 --> attrs
structure2 -->manifestA
structure2 -->manifestB
manifestA --> chunk1
manifestA --> chunk2
manifestB --> chunk3
manifestB --> chunk4

```

### State File

The **state file** records the current state of the dataset.
All transactions occur by updating or replacing the state file.
The state file contains, at minimum, a pointer to the latest structure file snapshot.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll add a few properties here, like a set of attributes that inform the format version, so the reader can now what/how to look for. But this is good enough for now.



The state file is a JSON file. It contains the following required and optional fields.

[TODO: convert to JSON schema]

| Name | Required | Type | Description |
|--|--|--|--|
| id | YES | str UID | A unique identifier for the dataset |
| generation | YES | int | An integer which must be incremented whenever the state file is updated |
| store_root | NO | str | A URI which points to the root location of the store in object storage. If blank, the store root is assumed to be in the same directory as the state file itself. |
| snapshots | YES | array[snapshot] | A list of all of the snapshots. |
Copy link
Contributor

@dcherian dcherian Aug 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| snapshots | YES | array[snapshot] | A list of all of the snapshots. |
| snapshots | YES | array[Mapping] | A list of all of the snapshots (described below) |

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we want just snapshot IDs. There is potentially other metadata in this data structure (timestamp, parent snapshot, etc.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah my bad. I was confused by what snapshot meant. I guess it's a Mapping so this is array[Mapping]?

| refs | NO | mapping[reference] | A mapping of references to snapshots |
rabernat marked this conversation as resolved.
Show resolved Hide resolved

A snapshot contains the following properties
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
A snapshot contains the following properties
A snapshot is a Mapping that contains the following properties

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is a "Mapping"? Do you mean like a dict? This is JSON, so I think the correct entity type would be an Object.


| Name | Required | Type | Description |
|--|--|--|--|
| snapshot-id | YES | str UID | Unique identifier for the snapshot |
| parent-snapshot-id | NO | str UID | Parent snapshot (null for no parent) |
rabernat marked this conversation as resolved.
Show resolved Hide resolved
| timestamp-ms | YES | int | When was snapshot commited |
| structure-file | YES | str | Name of the structure file for this snapshot |
| properties | NO | object | arbitrary user-defined attributes to associate with this snapshot |

References are a mapping of string names to snapshots


| Name | Required | Type | Description |
|--|--|--|--|
| name | YES | str | Name of the reference|
| snapshot-id | YES | str UID | What snaphot does it point to |
| type | YES | "tag" / "branch" | Whether the reference is a tag or a branch |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should leave this field open and let people use whatever they want, maybe reserving tag and branch as internal to the format.


### File Layout

The state file can be stored separately from the rest of the data or together with it. The rest of the data files in the dataset must be kept in a directory with the following structure.

- `$ROOT` base URI (s3, gcs, file, etc.)
- `$ROOT/state.json` (optional) state file
rabernat marked this conversation as resolved.
Show resolved Hide resolved
- `$ROOT/s/` for the structure files
- `$ROOT/a/` arrays and groups attribute information
- `$ROOT/i/` array chunk manifests (i for index or inventory)
rabernat marked this conversation as resolved.
Show resolved Hide resolved
- `$ROOT/c/` array chunks

### Structure Files

The structure file fully describes the schema of the dataset, including all arrays and groups.

The structure file is a Parquet file.
Each row of the file represents an individual node (array or group) of the Zarr dataset.

The structure file has the following Arrow schema:

```
id: uint16 not null
rabernat marked this conversation as resolved.
Show resolved Hide resolved
-- field metadata --
description: 'unique identifier for the node'
type: string not null
-- field metadata --
description: 'array or group'
path: string not null
-- field metadata --
description: 'path to the node within the store'
array_metadata: struct<shape: list<item: uint16> not null, data_type: string not null, fill_value: binary, dimension_names: list<item: string>, chunk_grid: struct<name: string not null, configuration: struct<chunk_shape: list<item: uint16> not null> not null>, chunk_key_encoding: struct<name: string not null, configuration: struct<separator: string not null> not null>, codecs: list<item: struct<name: string not null, configuration: binary>>>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For all array indexes we are currently using uint64. For the fill_value we'll use a Union array. For the chunk_grid, the current implementation is ignoring it and using just a normal uniform grid. Similar for the chunk key encoding, currently only recording one of the two values possible in zarr.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uint64 makes sense for array indexes. But not for chunk indexes; the chunk grid is orders of magnitude smaller than the actual array grid. Given that we have to store a lot of chunk indexes in this spec, I feel like we should try to use the smallest thing possible. But maybe that's premature optimization.

For the chunk_grid, the current implementation is ignoring it and using just a normal uniform grid.

I agree that makes sense from an implementation point of view. From a spec point of view, I think it makes sense to store the actual official Zarr V3 metadata directly, rather than designing a new schema and having to convert between them. Are you ok with that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should call out that we are only intending to support Zarr v3 stores / metadata at this point. Else we need to make the structure file more flexible than is currently shown (or a separate v2 structure file).

Copy link
Contributor

@paraseba paraseba Aug 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uint64 makes sense for array indexes. But not for chunk indexes
kind of?

Think height=1 pancakes, in which the array's time grid and chunk grid go in parallel.

Regarding how to store the zarr array metadata ... we can discuss more about it. Initially, I'm storing the columns explicitly, not as a blob, I thought the advantages were enough to justify the extra work. But it's close.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seba, I think it makes sense to iterate on the details of these parts of the schema as you are implementing. We won't know the "right" choice until we actually code it up.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely @rabernat ! That's the approach I'm taking, implementing the more basic encodings now, and iterate later once things work. This should be true for everything in this spec.

child 0, shape: list<item: uint16> not null
child 0, item: uint16
child 1, data_type: string not null
child 2, fill_value: binary
child 3, dimension_names: list<item: string>
child 0, item: string
child 4, chunk_grid: struct<name: string not null, configuration: struct<chunk_shape: list<item: uint16> not null> not null>
child 0, name: string not null
child 1, configuration: struct<chunk_shape: list<item: uint16> not null> not null
child 0, chunk_shape: list<item: uint16> not null
child 0, item: uint16
child 5, chunk_key_encoding: struct<name: string not null, configuration: struct<separator: string not null> not null>
child 0, name: string not null
child 1, configuration: struct<separator: string not null> not null
child 0, separator: string not null
child 6, codecs: list<item: struct<name: string not null, configuration: binary>>
child 0, item: struct<name: string not null, configuration: binary>
child 0, name: string not null
child 1, configuration: binary
-- field metadata --
description: 'All the Zarr array metadata'
inline_attrs: binary
-- field metadata --
description: 'user-defined attributes, stored inline with this entry'
attrs_reference: struct<attrs_file: string not null, row: uint16 not null, flags: uint16>
child 0, attrs_file: string not null
child 1, row: uint16 not null
child 2, flags: uint16
Comment on lines +202 to +203
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we describe these two parameters please. I don't understand what they mean.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These come directly from Seba's Icechunk V2 Notes

My understanding is that row is a pointer to the row in the attrs file where entries from this node start, while flags is a customizable extension point. Maybe @paraseba can elaborate more?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to know where to search for the metadata in a big attributes file, so we store the row to index to it in O(1). Flags is a way to know what to expect of a file before trying to parse it, so every time we have a pointer, we include a flags field that could tell you something like "this is the parquet format version 0.3, with support for X". I'm not loving this because the field is repeated in every row, which takes memory, but we can be cautious with the encoding. I'll think more about it once we get to implement these more detailed features.

-- field metadata --
description: 'user-defined attributes, stored in a separate attributes ' + 4
inventories: list<item: struct<inventory_file: string not null, row: uint16 not null, extent: list<item: fixed_size_list<item: uint16>[2]> not null, flags: uint16>>
child 0, item: struct<inventory_file: string not null, row: uint16 not null, extent: list<item: fixed_size_list<item: uint16>[2]> not null, flags: uint16>
child 0, inventory_file: string not null
child 1, row: uint16 not null
child 2, extent: list<item: fixed_size_list<item: uint16>[2]> not null
child 0, item: fixed_size_list<item: uint16>[2]
child 0, item: uint16
child 3, flags: uint16
```

### Attributes Files

[TODO: do we really need attributes files?]
rabernat marked this conversation as resolved.
Show resolved Hide resolved

### Chunk Manifest Files

A chunk manifest file stores chunk references.
Chunk references from multiple arrays can be stored in the same chunk manifest.
The chunks from a single array can also be spread across multiple manifests.

Chunk manifest files are Parquet files.
They have the following arrow schema.

```
id: uint32 not null
rabernat marked this conversation as resolved.
Show resolved Hide resolved
array_id: uint32 not null
coord: binary not null
inline_data: binary
chunk_file: string
rabernat marked this conversation as resolved.
Show resolved Hide resolved
offset: uint64
length: uint32 not null
```

- **id** - unique ID for the chunk.
- **array_id** - ID for the array this is part of
- **coord** - position of the chunk within the array. See _chunk coord encoding_ for more detail
- **chunk_file** - the name of the file in which the chunk resides
- **offset** - offset in bytes
- **length** - size in bytes

#### Chunk Coord Encoding

Chunk coords are tuples of positive ints (e.g. `(5, 30, 10)`).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few nits here:

a) remember that 0 is valid here
b) there has been talk of adding negative chunk coords to allow left appends - zarr-developers/zarr-specs#9

In normal Zarr, chunk keys are encoded as strings (e.g. `5.30.10`).
We want an encoding is:
- efficient (minimal storage size)
- sortable
- useable as a predicate in Arrow

The first two requirements rule out string encoding.
The latter requirement rules out structs or lists.

So we opt for a variable length binary encoding.
The chunk coord is created by encoding each element of the tuple a big endian `uint16` and then simply concatenating the bytes together in order. For the common case of arrays <= 4 dimensions, this would use 8 bytes or less per chunk coord.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uint16 can only hold < 10 years of an hourly dataset.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a different approach we could take to the coord index.

Rather than encoding the coord into a single field, we could define multiple optional int fields we could use as a multiindex, i.e.

array_id: uint32 not null
index0: uint32 not null
index1: uint32
index2: uint32
index3: uint32
index4: uint32
index5: uint32
index6: uint32
index7: uint32
inline_data: binary
chunk_id: binary
virtual_path: string
offset: uint64
length: uint32 not null

This would put a constraint on how many dimensions the chunk grid could support.

Some tests using pyarrow suggest that this is about 4x slower in terms of indexing than the binary encoding.


### Chunk Files

Chunk files contain the compressed binary chunks of a Zarr array.
Icechunk permits quite a bit of flexibility about how chunks are stored.
Chunk files can be:

- One chunk per chunk file (i.e. standard Zarr)
- Multiple contiguous chunks from the same array in a single chunk file (similar to Zarr V3 shards)
- Chunks from multiple different arrays in the same file
- Other file types (e.g. NetCDF, HDF5) which contain Zarr-compatible chunks

Applications may choose to arrange chunks within files in different ways to optimize I/O patterns.

## Algorithms

### Initialize New Store

### Write Snapshot

### Read Snapshot

### Expire Snapshots
110 changes: 110 additions & 0 deletions spec/schema.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
import pyarrow as pa

structure_schema = pa.schema(
[
pa.field("id", pa.uint16(), nullable=False, metadata={"description": "unique identifier for the node"}),
pa.field("type", pa.string(), nullable=False, metadata={"description": "array or group"}),
pa.field("path", pa.string(), nullable=False, metadata={"description": "path to the node within the store"}),
pa.field(
"array_metadata",
pa.struct(
[
pa.field("shape", pa.list_(pa.uint16()), nullable=False),
pa.field("data_type", pa.string(), nullable=False),
pa.field("fill_value", pa.binary(), nullable=True),
pa.field("dimension_names", pa.list_(pa.string())),
pa.field(
"chunk_grid",
pa.struct(
[
pa.field("name", pa.string(), nullable=False),
pa.field(
"configuration",
pa.struct(
[
pa.field("chunk_shape", pa.list_(pa.uint16()), nullable=False),
]
),
nullable=False
)
]
)
),
pa.field(
"chunk_key_encoding",
pa.struct(
[
pa.field("name", pa.string(), nullable=False),
pa.field(
"configuration",
pa.struct(
[
pa.field("separator", pa.string(), nullable=False),
]
),
nullable=False
)
]
)
),
pa.field(
"codecs",
pa.list_(
pa.struct(
[
pa.field("name", pa.string(), nullable=False),
pa.field("configuration", pa.binary(), nullable=True)
]
)
)
)
]
),
nullable=True,
metadata={"description": "All the Zarr array metadata"}
),
pa.field("inline_attrs", pa.binary(), nullable=True, metadata={"description": "user-defined attributes, stored inline with this entry"}),
pa.field(
"attrs_reference",
pa.struct(
[
pa.field("attrs_file", pa.string(), nullable=False),
pa.field("row", pa.uint16(), nullable=False),
pa.field("flags", pa.uint16(), nullable=True)
]
),
nullable=True,
metadata={"description": "user-defined attributes, stored in a separate attributes file"}
),
pa.field(
"inventories",
pa.list_(
pa.struct(
[
pa.field("inventory_file", pa.string(), nullable=False),
pa.field("row", pa.uint16(), nullable=False),
pa.field("extent", pa.list_(pa.list_(pa.uint16(), 2)), nullable=False),
pa.field("flags", pa.uint16(), nullable=True)
]
)
),
nullable=True
),
]
)

print(structure_schema)

manifest_schema = pa.schema(
[
pa.field("id", pa.uint32(), nullable=False),
pa.field("array_id", pa.uint32(), nullable=False),
pa.field("coord", pa.binary(), nullable=False),
pa.field("inline_data", pa.binary(), nullable=True),
pa.field("chunk_file", pa.string(), nullable=True),
pa.field("offset", pa.uint64(), nullable=True),
pa.field("length", pa.uint32(), nullable=False)
]
)

print(manifest_schema)