Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Virtual chunks design document #436

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open

Virtual chunks design document #436

wants to merge 5 commits into from

Conversation

paraseba
Copy link
Contributor

@paraseba paraseba commented Dec 4, 2024

These are early thoughts, feel free to provide feedback and different ideas. Also, particularly important: am I missing requirements?

Closes: #430

@paraseba
Copy link
Contributor Author

paraseba commented Dec 4, 2024

@rabernat said:

I'm very 👎 on creating a new mechanism for storing "config" outside of the existing snapshot system, particularly if it involves transactional updates. I would much rather create a new type of metadata file that is referenced from the snapshot. I'd like the entire state of the store to be self contained within a snapshot.

That's very similar to what I'm proposing here. I'm also proposing a new type of metadata file. But:

  • The config needs to be global, not per snapshot.
  • I don't want to have to serialize the request of the snapshot and the config, so the config_id cannot live in the snapshot.
  • I could duplicate the file on every snapshot, renaming it to have the same id as the new snapshot, but, seems wasteful. Most of the time the config won't change.

So my current thinking is: single configuration.msgpack file at the root of the repo. It's updated using PUT if-matches. On reads, it gets loaded concurrently with the snapshot file.

@paraseba
Copy link
Contributor Author

paraseba commented Dec 4, 2024

@rabernat said:

What if we created a manifest of every single virtual file that is referenced in the repo? How big could this be? it's hard to imagine more than, say 1_000_000 virtual files?

My counterexample in the design document is a virtual zarr dataset. Every chunk is a different file. All those file are on the same platform, bucket, etc. But even without being as extreme, this still has the same problem no? All those prefixes in the files manifest are wasteful.

Also, we still need some kind of report on "dependencies", the concept of "containers". This last concept is user level in the sense that the user will want to provide different credential sets (potentially). Not on a per file basis, but on whatever they consider a container potentially. Users will want to know "what other datasets this dataset depends on?", containers is somewhat of a proxy to that.


### Prefixes

Storing the full URL for each chunk duplicates a lot of data in the manifest, making it large. The idea is to try to extract common _prefixes_ and identify them with a short id. Then the manifest only needs to identify the prefix id + maybe an extra string + length + offset.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this feel ever familiar to me

Copy link

@maxrjones maxrjones left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super excited about this work, thanks for sharing the link in the icechunk slack!


## What we need

1. Virtual chunks can be stored in all supported object stores

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know Rich Signell has used Kerchunk on local filesystems. Do you think virtual chunk storage needs to be limited to object stores?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it doesn't, eventually we can support filesystems

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it already supported?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we have support for filesystem meant mostly for tests. We won't drop it.

2. Multiple credentials sets for virtual chunk resolution
3. Optimized manifests where the same prefix doesn't need to be repeated for every virtual chunk
4. Be able to know all the "places" pointed by a repo, without having to load every manifest in every version
5. Object level change detection (not serving a virtual chunk if the containing file changed)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if a more specific requirement here would be helpful. What does "not serve" mean? Is it an error, warning, or configurable behavior?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Runtime error when the object has changed instead of getting invalid data

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the Runtime Error be thrown at data-access time or store/session-opening time?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not practical to do at store-opening time. That would mean checking the etag / modification time of every virtual file eagerly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the description to make it more clear.

}
```

* The `last_modified` field contains an optional unix epoch in seconds. If set for a chunk, retrieval will fail if the file was modified after this time. We use this instead of an ETag or checksum to save space on the manifests. `u32` should be enough until year ~ 2100.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Appreciating that using an optional unix epoch saves on space, what do you expect the impact of chosing unix epoch over ETag or checksum to be on read or write time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't expect any performance impact on reads. The object store will retrieve the object only if the "didn't change" conditions are met without any observable latency difference. On writes, it's a different story, we need to discuss with @TomNicholas if the new proposal is something we can implement in VirtualiZarr and what the impact could be.

In any case, I'm making the checksum optional at the chunk level.

@@ -0,0 +1,163 @@
# Improvements to virtual chunks handling

## What we need

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that this design document pertains mostly to the VirtualChunkContainer, so I'm not sure if these are in scope but there are some interactions at the store level that may be worth mentioning. For example,


6. > Object level change detection (not serving a virtual chunk if the containing file changed)

We optionally store the last-modified timestamp of the object containing each chunk. This is not optimal, a hash would be better, but it takes more space. Also not optimal, storing it per chunk, when many chunks come from the same object, but we don't have other easy place to store it if they are using containers that include more than one object. We could add more configuration to optimize this in the future.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't have other easy place to store it if they are using containers that include more than one object

Both the redundant prefix and redundant timestamp issue seem similar in some ways to the redundant STAC information solved by hydration/dehydration in pgstac. I'm curious if there's anything to be learned from the benefits/downsides of that approach.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dropped the requirement to do this optimization at this time. I think we can make it comfortable to 1.0 without it. In any case, this was an in-memory space optimization that we can add later.

* Example: `s3://some-bucket/some-prefix/c/0/0/1` is pointed to by only 1 chunk. But there are many objects like it.
* There could be tens of millions of chunks (so different urls)

### Prefixes

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On this whole topic I would comment:

  • compression algorithms are pretty good at this, if you have a bunch of long strings with some prefix agreement between then, it will not result in extra bytes in storage
  • parquet supports prefix-delta encoding ( https://parquet.apache.org/docs/file-format/data-pages/encodings/#delta-strings-delta_byte_array--7 )
  • if you really want to go down the road of constructing strings, only make them on demand, else you fill up RAM anyway. I found that templating is surprisingly slow (with python strings)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are currently not optimizing the on-disk format. This was meant as an in-memory optimization, Icechunk keeps full manifests in memory. But in my latest commit, I'm dropping this requirement for now, and we'll evaluate better ways to optimize the in memory datastructure independently of the on-disk format.


The advantage of per snapshot is that we don't need to load/parse all the prefixes if the current snapshot uses only a few. A case for this would be a virtual dataset that has been materialized in a newer version. If prefixes are per-snapshot we wouldn't need to load any prefixes.

On the other hand, requirement 4 asks for the full list of "dependencies" for the dataset, and that means for all versions, which requires a global list of prefixes. Without a global list, we would need to fetch the list of prefixes of each snapshot.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parquet implements min/max on any column for each partition, so you know from the file metadatas alone what they contain, without loading all the data.


### Supporting multiple clouds

When the user adds a virtual ref, they need to indicate the details of the platform to retrieve it. These include what cloud it is (so we know how to set up credentials), region, endpoint url, port and other settings. It looks a lot like our `S3Config` type (or an expanded version of that).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Credentials and config will not be the same for any given user. I don't think you can reasonably encode them for general use. As far as S3 goes, the only fields I would think would be up for consideration would be: anonymous, endpoint and region (the latter debatable). Of these, only anonymous is relevant for gcs/azure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely, I think things are more clear in my latest commit. Credentials are an "Instance construction" concern, not a write time concern. Different users will instantiate the repository with different credentials sets. We just need a way to tell them what they need to provide, and a way for them to provide it.

platform: Platform, // something like S3
url_template: String, // something like some-bucket/some-prefix/output/{}/{}.nc
default_template_arguments: Vec<String>,
cloud_client_config: Map<String, Any>, // something like {endpoint_url: 8080}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The endpoint being different for some S3 than other S3 is totally a use case.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider, a kerchunked original dataset on a public AWS S3, and additional updates on my own minio.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you can have different containers, with different endpoints.

}
```

* The `last_modified` field contains an optional unix epoch in seconds. If set for a chunk, retrieval will fail if the file was modified after this time. We use this instead of an ETag or checksum to save space on the manifests. `u32` should be enough until year ~ 2100.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but actually you are after a (non-cryptographic) hash. Relying on times seems dangerous, even if we assume everything it UTC.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its debatable I think. A few thoughts:

  • We did hashes in Arraylake and they didn't get a lot of use, they are expensive to compute for example.
  • Timezones are not a concern here, objects stores have a well defined and standardized format for their modified time headers.
  • Doing last_modified is very cheap and easy at write time, for example, if you pass now() it will work most of the time. Etag you generally need to retrieve from the object store for each object. Imagine virtualizing a zarr store, you need to fetch the etag for every chunk.
  • Last modified takes less space

In any case, this is very useful feedback. In my last commit I made it possible to provide timestamp, etag, or nothing.

Now the user initializes a Repo instance by passing, if needed, a credential set for each virtual chunk container. If a chunk is read for which there are no credentials, we default to a common set of credentials for virtual chunks, and ultimately, to the Storage credentials if on the same platform.
3. > Optimized manifests where the same prefix doesn't need to be repeated for every virtual chunk

For large prefixes, the template + arguments creates some savings. For example, a template for virtual zarr like `some-long-bucket-name/with-some-even-longer-prefix/and-more-nesting/some-long-array-name-for-the-dataset/c/{}/{}/{}`.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I need a lot of convincing on this in the presence of decent compression, say Zstd.

strings = "".join([f"some-long-bucket-name/with-some-even-longer-prefix/and-more-nesting/some-long-array-name-for-the-dataset/c/{i}/{i}/{i}" for i in range(10000)])
out = cramjam.zstd.compress(strings.encode())
len(strings) # 1206670
len(out)   # 19052, 1.9 bytes per entry

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remember this was an optimization for the in-memory datastructure. Coupling it with the on-disk format was a mistake (as a consequence of Icechunk still not having a separation layer there). But even so, this is not the right way to optimize this. I dropped the requirement, we can absolutely do this optimization later. Thanks for the feedback on this, it changed my mind.


6. > Object level change detection (not serving a virtual chunk if the containing file changed)

We optionally store the last-modified timestamp of the object containing each chunk. This is not optimal, a hash would be better, but it takes more space. Also not optimal, storing it per chunk, when many chunks come from the same object, but we don't have other easy place to store it if they are using containers that include more than one object. We could add more configuration to optimize this in the future.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but it takes more space

Take as many bytes as you feel necessary, It doesn't need to be a hash of the data, but a unique value from the store.

and these are stored perfile*, whereas there may be one or many chunks per file.

Copy link
Contributor

@rabernat rabernat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seba this is a really thoughtful and useful document for exploring this very tricky topic. Thanks so much for taking the time to write it out.

I think the proposal you have laid out will definitely work. I also think there is some room to clarify the requirements (what are we really optimizing for?) and possibly scope them down a bit.

The crux for me is defining the actual data model we will use, and your structs are a very useful way to do that. The key question appears to be, what exactly is "place" (c.f. req. 4)? Right now, I am concerned that VirtualChunkContainer couples together two separate requirements (4 and 3 -"optimized manifests") unnecessarily.

I personally like my idea of storing the config right in the branch / tag file. 😁

I also still think it is worth considering storing each virtual file in a manifest of its own, and then referencing this from the chunk manifest.

## What we need

1. Virtual chunks can be stored in all supported object stores
2. Multiple credentials sets for virtual chunk resolution
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we sharpen this requirement a bit? It is currently an incomplete sentence and is thus less prescriptive than 1. Maybe there are actually two separate requirements here?

Suggested change
2. Multiple credentials sets for virtual chunk resolution
2. Virtual chunks within a single repo can be stored in multiple different object storage buckets in different platforms
2.5. Users can specify different credentials for different object storage locations

Having written this down, it seems like 2 is a really broad and tricky requirement to satisfy. We could consider narrowing it to the following.

2. Virtual chunks within a repo can be stored in _one single object storage bucket in a specific cloud platform_. Mixing virtual references from multiple buckets in a single repo is not allowed.

I'm wondering if that would be too restrictive. It would certainly simply if the credentials story a lot.

@TomNicholas - can you think of scenarios where you would want to mix multiple different buckets in a sigle repo?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be useful to create virtual stores that would allow people to open a data tree containing both raw data and overviews. The raw data would commonly be in a separate bucket from the overviews.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate? Why? Could the overview be native writes to the repo?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate? Why? Could the overview be native writes to the repo?

To be fair, I'm not sure how common this is but the primary case in which the overviews cannot be native writes to the repo is when they already exist as Zarr V2 data.

A more common scenario would be continuously adding NetCDF references as virtual chunks along with continuously adding Zarr overviews. My understanding from previous conversations was that once native data is committed to a repository, all new commits will need to be native rather than virtual. Is that incorrect? Could one continuously add virtual chunks for the full resolution data and native chunks for the overviews?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could one continuously add virtual chunks for the full resolution data and native chunks for the overviews?

Icechunk does not place restrictions on which chunks are native and which are virtual

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Icechunk does not place restrictions on which chunks are native and which are virtual

Yes you should already be able to create "mixed" icechunk stores using xarray and virtualizarr. Aimee's append_dim PR goes even further because it is one way to create a mixed array, where you could append native chunks to an array containing virtual chunks (see zarr-developers/VirtualiZarr#272 (comment)).


@TomNicholas - can you think of scenarios where you would want to mix multiple different buckets in a single repo?

If you wanted to use virtual references to create some kind of aggregated dataset, e.g. put a bunch of disparate model run datasets that together comprise CMIP6 into one virtual store.

Those model runs could be in different buckets (or even clouds...) from different data providers. This kind of scenario is a little more likely to occur across groups in a store rather than within a single group/array IMO - think a DataTree where each group came from a different source. I think its debatable whether that use case should be targeted by some kind of higher-level catalog instead though.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a DataTree where each group came from a different source. I think its debatable whether that use case should be targeted by some kind of higher-level catalog instead though.

I would say that absolutely this looks like a single dataset object

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good question: how deep of a hierarchy should we expect to store in a single Icechunk repo? Could we put all of CMIP6 in a single repo?

This is a problem we have also wrestled with in Arraylake.

If nothing in Icechunk scales as O(N) where N is the number of arrays / groups / chunks / reference / etc., then it is possible.

However, the current design doesn't scale well to this case because all of the metadata for the whole store go in one single snapshot file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes there is an extremely thin line between "catalog" and "manifest" with arbitrary trees.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think things look much better in the latest commit. Please take a look and tell me what you think. The solution seems much simpler, and still allows all kind of mixes of object stores, prefixes, credentials, etc.


1. Virtual chunks can be stored in all supported object stores
2. Multiple credentials sets for virtual chunk resolution
3. Optimized manifests where the same prefix doesn't need to be repeated for every virtual chunk
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
3. Optimized manifests where the same prefix doesn't need to be repeated for every virtual chunk
3. Efficient storage of chunk references, which often contain highly redundant prefixes

Also very important to clarify what is meant by "efficient" or "optimized" -- specifically whether we are trying to optimize for on-disk storage vs. in memory representation vs. lookup time or something else.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good feedback, I was too close to the problem and failed to add this context. This requirement has been dropped for now, it's an in-mem optimization that we can do at a later time, if needed.


In the first url style above, the prefix could be the full file, `s3://some-bucket/some-prefix/some-file.nc`. In the second example, there are options for the prefix, it could be `s3://some-bucket/some-prefix/c/0/0/1`, or `s3://some-bucket/some-prefix/c/0/0` or `s3://some-bucket/some-prefix/c/0` or `s3://some-bucket/some-prefix/c`, with different numbers of chunks under it in each case.

Intelligence is needed to optimize what is consider a good prefix. If done badly, we could have:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is intelligence really needed? Or would an off-the shelf compression algorithm handle it just fine?

I decided to test this with my favorite little storage library

import zarr
import numpy as np

refix = 's3://some-bucket/some-prefix/c/0/'
paths = np.array([str(i) for i in range(10000)])
full_paths = np.array([prefix + p for p in paths])

store = zarr.MemoryStore()
group = zarr.group(store=store)

path_array = group.create("paths", shape=len(paths), chunks=len(paths), dtype=str, compressor=zarr.Zstd())
path_array[:] = paths

full_path_array = group.create("full_paths", shape=len(paths), chunks=len(paths), dtype=str, compressor=zarr.Zstd())
full_path_array[:] = full_paths

print(path_array.nbytes_stored)
print(full_path_array.nbytes_stored)
# 6579
# 7048

So storing the full paths and then using a zstd compressor on the raw bytes creates only 7% more data than just storing the last part of the path.

This is what compression algorithms do: find redundant patterns in data. Should we really spend time trying to implement our own compression scheme for prefixes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is my failure to clarify that we were looking for an in-memory optimization. Currently Icechunk lacks a layer of separation between the in memory and on-disk datastructures (that will have to come soon) which made things even more confusing.


The advantage of per snapshot is that we don't need to load/parse all the prefixes if the current snapshot uses only a few. A case for this would be a virtual dataset that has been materialized in a newer version. If prefixes are per-snapshot we wouldn't need to load any prefixes.

On the other hand, requirement 4 asks for the full list of "dependencies" for the dataset, and that means for all versions, which requires a global list of prefixes. Without a global list, we would need to fetch the list of prefixes of each snapshot.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should sharpen the requirement about what exactly these dependencies are. Does the prefix have to be part of it? That feels like a bit of a separate concern. The point of the dependencies is to know what credentials will be required and where on the internet the virtual chunks might live. For these purposes, I'd suggest it's enough to know

  • The platform (e.g. S3)
  • The bucket name
  • [Optional] The region
  • [Optional] The endpoint URL

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is much more clear in my latest commit.


### Who creates/decides the prefixes

Obviously, generating the list of prefixes given the list of virtual chunks is doable automatically. But it's a tricky optimization problem. One strategy could be to generate a trie with all chunk urls, and then use some heuristic to select prefixes using heavily populated nodes. It's tricky and potentially computationally expensive for large repos.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where simply defining prefix as {protocol}://{bucket} might save us a lot of trouble, and then we let a standard compression scheme handle the rest.

}
```

* Users can add to the list and edit members but not delete.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But does garbage collection potentially do this?

I don't think we can meaningfully distinguish between "users" and "admins" of a repo. Currently the only way to update a repo is to make a commit. If we start making out-of-band ways to change a repo, don't we open ourselves up to consistency problems?


* The list of chunk containers is stored in a new global `Config` object. There is a single configuration per repo, and all reads and writes are done reading the latest configuration. In the initial implementation, the "history" of configuration is present only for reference and user inspection by hand.

TBD: how to store this information. One approach would be using the new PUT if-match S3 functionality, but this would be the first place where we overwrite objects.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update Config and commit to a branch at the same time? What if a branch commit succeeds but Config update fails? Or vice versa? S3 makes no promises about transactions across multiple keys...that's what we have icechunk for.

This whole concept feels like a very slippery slope of nasty edge cases.

However, I see it may be necessary if we want to expose information that transcends an individual snapshot.

Here's an alternative approach: we could store the Config directly inside the branch or tag file as JSON. Obviously this does not scale to large numbers of VirtualChunkContainers. But it would have these benefits:

  • Does not increase the number of GET requests to open a store
  • Guaranteed consistency between commit and Config

The downside is that, to gather all of the dependencies for a repo, we have to read all of the tag and branch files (note: we never have to read the branch history, only ever the latest branch file). This number will probably be small for most repos.

Comment on lines 93 to 96
struct Config {
inline_chunk_threshold_bytes: u16,
virtual_chunk_containers: Vec<VirtualChunkContainer>,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I really like the idea of repo-level config stored with the repo. Iceberg has this and it seems really useful. Could eventually put all kinds of stuff here.

}
```

* The `last_modified` field contains an optional unix epoch in seconds. If set for a chunk, retrieval will fail if the file was modified after this time. We use this instead of an ETag or checksum to save space on the manifests. `u32` should be enough until year ~ 2100.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like an unnecessary space optimization. ETag is clearly meant for this purpose. ETag has several advantages over timestamp. In particular, if the file is overwritten with the same data, the ETag doesn't change. Also, no timezone headaches to worry about.

At least we could make this a configuration choice and leave it up to the user.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overwritten with the same data, the ETag doesn't change

This is only true (on S3) with a single PUT. For multi-part-uploads, the chunks sizes need to match too and maybe something else too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the latest commit, I'm allowing last_modified and etag (or nothing). What I REALLY like about last_modified is that you can "invent" it, you don't really need to fetch every etag for every object from the store. If you just pass now as last_modified, it will work in most cases.


### Reading virtual chunks

* When a user create a repository instance, they can specify credentials for each `VirtualChunkContainer`, pointing to them by name.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, name is used to associate credentials with containers.

Do we have a feeling for how many VirtualChunkContainers we exect a typical repo to have?

Mixing the prefix into the VirtualChunkContainer will complicate this quite a bit. Credentials will generally be scoped to buckets, not prefixes. So if there are many VirtualChunkContainers with the same bucket, they would have to redundantly specify credentials for each one.

This problem goes away if you basically make each VirtualChunkContainer 1:1 with a set of credentials.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Credentials will generally be scoped to buckets, not prefixes.

All backends support intra-bucket permissioning. In addition, you might have the same bucket name at different endpoints.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, allowing more than one set of credentials for different paths of a bucket was part of the design (unfortunately as an unlisted requirement now that I think about it). It's also possible in my new commit.

* Dropped manifest optimization goal
* Dropped the performance goal for fetching all required virtual
  containers
* Much simplified and optional mechanism for defining virtual containers
* Simplified interaction with VirtualiZarr
* Added optional Etag
* More specifics about `Config`
* Mode details on the algorithm for virtual chunk location resolution

* s3://
* gcs://
* tigris:// (is this the right protocol)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. it's just S3 with a custom endpoint URL.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is also credentials... We may need to know it's tigris so we manage the generation of credentials differently from say localhost minio. Of course, we can always play games with the endpoint and figure out, but this seems more future-proof?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we manage the generation of credentials differently from say localhost minio

I recommend you keep to using ("s3") for everything that indeed matches that protocol, and the rest is in the configuration for the individual "containers" - isn't that the whole point? I don't think you want to maintain a potentially endless list of protocols for each and service that might have some icy data.

generation of credentials

What does this mean?

By the way, how exactly are chunk URLs mapped to "containers", i.e., how do you know which container a particular URL belongs to?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One downside of using s3:// everywhere is that it forces users to actually create the VirtualChunkContainers, even for some very common object stores like Google's. Using the protocol would be a bit easier on them for the very common case of a simple repo on google or tigris. But I take your point, which ones are s3:// and which ones are a protocol is arbitrary. Will think more about it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how exactly are chunk URLs mapped to "containers",

@martindurant a rough description of the algorithm is documented here

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One downside of using s3:// everywhere is that it forces users to actually create the VirtualChunkContainers, even for some very common object stores like Google's.

?? google's storage does not use the s3 protocol. When I say "protocol", mean exactly that: a set of rules and conventions for how to communicate. S3's is defined by AWS but implemented by many services. GCS is only used by google (as far as I know).

a rough description of the algorithm

made some additional comments

Copy link

@martindurant martindurant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments on "container", which I think had better be renamed "prefixes", since hat;'s exactly what they are.


This is how the mapping from location to cloud+bucket+prefix+credentials works:

* Using the `location` protocol we find all containers that match the protocol

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know what this means

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rewrote the line.


* Using the `location` protocol we find all containers that match the protocol
* We sort the matching containers in order of descending prefix length
* We find the first one that has a `bucket_prefix` that is a prefix of the `location` path (this will be the most specific container that matches)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do buckets have anything to do with here?
Prefix is a string of any length, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The chunk's URL is s3://foo/bar/baz.nc, we need to find a container that matches it to know what set of credentials and object store client configuration to use. If there are two containers that work for protocol s3, the one with bucket_prefix = foo/bar will match, instead of the one with bucket_prefix = hello or the one with bucket_prefix = foo.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"We find the container with the longest prefix matching the given URL"

* We use the rest of the data in the matched container to instantiate an object store client (or retrieve it from cache)
* Credentials for the client must be passed by the user when the repository is instantiated, by indicating the container name to which they apply

We have one default container for each supported object store, the container has protocol and name s3, gcs, tigris, etc. and has an empty `bucket_prefix` and no region configured. These default containers, just like any others set in the persistent repo configuration, can be overridden when the repository is instantiated.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, to be precise, you are envisioning a structure like:

[{"name": "acontainer",
   "prefix": "gcs://bucket/path/prefix",
   "options": {"region": "US-EAST1"}},
 {...}
]

where the "name" handle is what the user passes to change the default?

and "prefix" is literal, it matches any paths that start with that string.

When a region is not indicated and the platform needs it, the logic is:

* For the platform in which the Icechunk repo is stored, we use the same region as the repo
* For all other platforms we use the default region for that platform

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, so you ALSO envision default options per_protocol ? So you have some sort of options merging, from less specific to more specific? It will take care to allow the user to override at any level.


* Using the `location` protocol we find all containers that match the protocol
* Using the chunk location's protocol, we find all containers that have a matching `protocol` field

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just the first few characters of the prefix - why separate it out from the prefix in general? Sometimes the prefix might be "s3", sometimes "s3://bucket", sometimes "s3://bucket/path/to". Aside from knowing that these particular examples are all s3 protocol, I don't think that there's any difference in how you should treat them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this. My remaining concern is, how to make it simple for the simple use case. Say, virtual chunks in google storage, repository in AWS. I don't want users to have to create a container for this simple case. That's why I'm proposing to distinguish the platform by the url protocol (by default).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then you have two:

  • "s3"
  • "gs"
    (and they happen to exactly look like what you would expect as protocols).

You could even have the default default be "", i.e., matches any prefix that's not matched by something more specific.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Virtual chunks improvements
7 participants