Virtual chunks design document #436

paraseba · 2024-12-04T01:36:25Z

These are early thoughts, feel free to provide feedback and different ideas. Also, particularly important: am I missing requirements?

Closes: #430

paraseba · 2024-12-04T01:43:26Z

@rabernat said:

I'm very 👎 on creating a new mechanism for storing "config" outside of the existing snapshot system, particularly if it involves transactional updates. I would much rather create a new type of metadata file that is referenced from the snapshot. I'd like the entire state of the store to be self contained within a snapshot.

That's very similar to what I'm proposing here. I'm also proposing a new type of metadata file. But:

The config needs to be global, not per snapshot.
I don't want to have to serialize the request of the snapshot and the config, so the config_id cannot live in the snapshot.
I could duplicate the file on every snapshot, renaming it to have the same id as the new snapshot, but, seems wasteful. Most of the time the config won't change.

So my current thinking is: single configuration.msgpack file at the root of the repo. It's updated using PUT if-matches. On reads, it gets loaded concurrently with the snapshot file.

paraseba · 2024-12-04T01:49:24Z

@rabernat said:

What if we created a manifest of every single virtual file that is referenced in the repo? How big could this be? it's hard to imagine more than, say 1_000_000 virtual files?

My counterexample in the design document is a virtual zarr dataset. Every chunk is a different file. All those file are on the same platform, bucket, etc. But even without being as extreme, this still has the same problem no? All those prefixes in the files manifest are wasteful.

Also, we still need some kind of report on "dependencies", the concept of "containers". This last concept is user level in the sense that the user will want to provide different credential sets (potentially). Not on a per file basis, but on whatever they consider a container potentially. Users will want to know "what other datasets this dataset depends on?", containers is somewhat of a proxy to that.

martindurant · 2024-12-04T02:54:23Z

design-docs/001-virtual-chunk-improvements.md

+
+### Prefixes
+
+Storing the full URL for each chunk duplicates a lot of data in the manifest, making it large. The idea is to try to extract common _prefixes_ and identify them with a short id. Then the manifest only needs to identify the prefix id + maybe an extra string + length + offset.


Does this feel ever familiar to me

maxrjones

Super excited about this work, thanks for sharing the link in the icechunk slack!

maxrjones · 2024-12-04T02:34:53Z

design-docs/001-virtual-chunk-improvements.md

+
+## What we need
+
+1. Virtual chunks can be stored in all supported object stores


I know Rich Signell has used Kerchunk on local filesystems. Do you think virtual chunk storage needs to be limited to object stores?

No it doesn't, eventually we can support filesystems

Isn't it already supported?

Yeah, we have support for filesystem meant mostly for tests. We won't drop it.

maxrjones · 2024-12-04T02:36:06Z

design-docs/001-virtual-chunk-improvements.md

+2. Multiple credentials sets for virtual chunk resolution
+3. Optimized manifests where the same prefix doesn't need to be repeated for every virtual chunk
+4. Be able to know all the "places" pointed by a repo, without having to load every manifest in every version
+5. Object level change detection (not serving a virtual chunk if the containing file changed)


I wonder if a more specific requirement here would be helpful. What does "not serve" mean? Is it an error, warning, or configurable behavior?

Runtime error when the object has changed instead of getting invalid data

Should the Runtime Error be thrown at data-access time or store/session-opening time?

It's not practical to do at store-opening time. That would mean checking the etag / modification time of every virtual file eagerly.

I changed the description to make it more clear.

maxrjones · 2024-12-04T02:40:01Z

design-docs/001-virtual-chunk-improvements.md

+}
+```
+
+* The `last_modified` field contains an optional unix epoch in seconds. If set for a chunk, retrieval will fail if the file was modified after this time. We use this instead of an ETag or checksum to save space on the manifests. `u32` should be enough until year ~ 2100.


Appreciating that using an optional unix epoch saves on space, what do you expect the impact of chosing unix epoch over ETag or checksum to be on read or write time?

I don't expect any performance impact on reads. The object store will retrieve the object only if the "didn't change" conditions are met without any observable latency difference. On writes, it's a different story, we need to discuss with @TomNicholas if the new proposal is something we can implement in VirtualiZarr and what the impact could be.

In any case, I'm making the checksum optional at the chunk level.

maxrjones · 2024-12-04T02:45:42Z

design-docs/001-virtual-chunk-improvements.md

@@ -0,0 +1,163 @@
+# Improvements to virtual chunks handling
+
+## What we need


I understand that this design document pertains mostly to the VirtualChunkContainer, so I'm not sure if these are in scope but there are some interactions at the store level that may be worth mentioning. For example,

Does not provide a barrier to distributed writes of virtual chunks (xref Build virtual Zarr store using Xarray's dataset.to_zarr(region="...") model zarr-developers/VirtualiZarr#308)

Does not provide a barrier to combining virtual chunks of variable lengths in a single store (xref Support variable-length chunks zarr-developers/VirtualiZarr#12)

Does not provide a barrier to combining virtual chunks with "inlined" data in a single store (xref "Inlining" data when writing references to disk zarr-developers/VirtualiZarr#62)

maxrjones · 2024-12-04T02:54:17Z

design-docs/001-virtual-chunk-improvements.md

+
+6. > Object level change detection (not serving a virtual chunk if the containing file changed)
+
+    We optionally store the last-modified timestamp of the object containing each chunk. This is not optimal, a hash would be better, but it takes more space. Also not optimal, storing it per chunk, when many chunks come from the same object, but we don't have other easy place to store it if they are using containers that include more than one object. We could add more configuration to optimize this in the future.


we don't have other easy place to store it if they are using containers that include more than one object

Both the redundant prefix and redundant timestamp issue seem similar in some ways to the redundant STAC information solved by hydration/dehydration in pgstac. I'm curious if there's anything to be learned from the benefits/downsides of that approach.

I dropped the requirement to do this optimization at this time. I think we can make it comfortable to 1.0 without it. In any case, this was an in-memory space optimization that we can add later.

martindurant · 2024-12-04T02:58:46Z

design-docs/001-virtual-chunk-improvements.md

+  * Example: `s3://some-bucket/some-prefix/c/0/0/1` is pointed to by only 1 chunk. But there are many objects like it.
+  * There could be tens of millions of chunks (so different urls)
+
+### Prefixes


On this whole topic I would comment:

compression algorithms are pretty good at this, if you have a bunch of long strings with some prefix agreement between then, it will not result in extra bytes in storage

parquet supports prefix-delta encoding ( https://parquet.apache.org/docs/file-format/data-pages/encodings/#delta-strings-delta_byte_array--7 )

if you really want to go down the road of constructing strings, only make them on demand, else you fill up RAM anyway. I found that templating is surprisingly slow (with python strings)

We are currently not optimizing the on-disk format. This was meant as an in-memory optimization, Icechunk keeps full manifests in memory. But in my latest commit, I'm dropping this requirement for now, and we'll evaluate better ways to optimize the in memory datastructure independently of the on-disk format.

martindurant · 2024-12-04T03:00:06Z

design-docs/001-virtual-chunk-improvements.md

+
+The advantage of per snapshot is that we don't need to load/parse all the prefixes if the current snapshot uses only a few. A case for this would be a virtual dataset that has been materialized in a newer version. If prefixes are per-snapshot we wouldn't need to load any prefixes.
+
+On the other hand, requirement 4 asks for the full list of "dependencies" for the dataset, and that means for all versions, which requires a global list of prefixes. Without a global list, we would need to fetch the list of prefixes of each snapshot.


Parquet implements min/max on any column for each partition, so you know from the file metadatas alone what they contain, without loading all the data.

martindurant · 2024-12-04T03:03:11Z

design-docs/001-virtual-chunk-improvements.md

+
+### Supporting multiple clouds
+
+When the user adds a virtual ref, they need to indicate the details of the platform to retrieve it. These include what cloud it is (so we know how to set up credentials), region, endpoint url, port and other settings. It looks a lot like our `S3Config` type (or an expanded version of that).


Credentials and config will not be the same for any given user. I don't think you can reasonably encode them for general use. As far as S3 goes, the only fields I would think would be up for consideration would be: anonymous, endpoint and region (the latter debatable). Of these, only anonymous is relevant for gcs/azure.

Absolutely, I think things are more clear in my latest commit. Credentials are an "Instance construction" concern, not a write time concern. Different users will instantiate the repository with different credentials sets. We just need a way to tell them what they need to provide, and a way for them to provide it.

martindurant · 2024-12-04T03:04:39Z

design-docs/001-virtual-chunk-improvements.md

+  platform: Platform,  // something like S3
+  url_template: String, // something like some-bucket/some-prefix/output/{}/{}.nc
+  default_template_arguments: Vec<String>,
+  cloud_client_config: Map<String, Any>,  // something like {endpoint_url: 8080}


The endpoint being different for some S3 than other S3 is totally a use case.

Consider, a kerchunked original dataset on a public AWS S3, and additional updates on my own minio.

Yes, you can have different containers, with different endpoints.

martindurant · 2024-12-04T03:08:04Z

design-docs/001-virtual-chunk-improvements.md

+}
+```
+
+* The `last_modified` field contains an optional unix epoch in seconds. If set for a chunk, retrieval will fail if the file was modified after this time. We use this instead of an ETag or checksum to save space on the manifests. `u32` should be enough until year ~ 2100.


but actually you are after a (non-cryptographic) hash. Relying on times seems dangerous, even if we assume everything it UTC.

Its debatable I think. A few thoughts:

We did hashes in Arraylake and they didn't get a lot of use, they are expensive to compute for example.

Timezones are not a concern here, objects stores have a well defined and standardized format for their modified time headers.

Doing last_modified is very cheap and easy at write time, for example, if you pass now() it will work most of the time. Etag you generally need to retrieve from the object store for each object. Imagine virtualizing a zarr store, you need to fetch the etag for every chunk.

Last modified takes less space

In any case, this is very useful feedback. In my last commit I made it possible to provide timestamp, etag, or nothing.

martindurant · 2024-12-04T03:15:54Z

design-docs/001-virtual-chunk-improvements.md

+    Now the user initializes a Repo instance by passing, if needed, a credential set for each virtual chunk container. If a chunk is read for which there are no credentials, we default to a common set of credentials for virtual chunks, and ultimately, to the Storage credentials if on the same platform.
+3. > Optimized manifests where the same prefix doesn't need to be repeated for every virtual chunk
+
+    For large prefixes, the template + arguments creates some savings. For example, a template for virtual zarr like `some-long-bucket-name/with-some-even-longer-prefix/and-more-nesting/some-long-array-name-for-the-dataset/c/{}/{}/{}`.


Again, I need a lot of convincing on this in the presence of decent compression, say Zstd.

strings = "".join([f"some-long-bucket-name/with-some-even-longer-prefix/and-more-nesting/some-long-array-name-for-the-dataset/c/{i}/{i}/{i}" for i in range(10000)]) out = cramjam.zstd.compress(strings.encode()) len(strings) # 1206670 len(out) # 19052, 1.9 bytes per entry

Remember this was an optimization for the in-memory datastructure. Coupling it with the on-disk format was a mistake (as a consequence of Icechunk still not having a separation layer there). But even so, this is not the right way to optimize this. I dropped the requirement, we can absolutely do this optimization later. Thanks for the feedback on this, it changed my mind.

martindurant · 2024-12-04T03:18:01Z

design-docs/001-virtual-chunk-improvements.md

+
+6. > Object level change detection (not serving a virtual chunk if the containing file changed)
+
+    We optionally store the last-modified timestamp of the object containing each chunk. This is not optimal, a hash would be better, but it takes more space. Also not optimal, storing it per chunk, when many chunks come from the same object, but we don't have other easy place to store it if they are using containers that include more than one object. We could add more configuration to optimize this in the future.


but it takes more space

Take as many bytes as you feel necessary, It doesn't need to be a hash of the data, but a unique value from the store.

and these are stored perfile*, whereas there may be one or many chunks per file.

rabernat

Seba this is a really thoughtful and useful document for exploring this very tricky topic. Thanks so much for taking the time to write it out.

I think the proposal you have laid out will definitely work. I also think there is some room to clarify the requirements (what are we really optimizing for?) and possibly scope them down a bit.

The crux for me is defining the actual data model we will use, and your structs are a very useful way to do that. The key question appears to be, what exactly is "place" (c.f. req. 4)? Right now, I am concerned that VirtualChunkContainer couples together two separate requirements (4 and 3 -"optimized manifests") unnecessarily.

I personally like my idea of storing the config right in the branch / tag file. 😁

I also still think it is worth considering storing each virtual file in a manifest of its own, and then referencing this from the chunk manifest.

rabernat · 2024-12-04T02:10:34Z

design-docs/001-virtual-chunk-improvements.md

+## What we need
+
+1. Virtual chunks can be stored in all supported object stores
+2. Multiple credentials sets for virtual chunk resolution


Can we sharpen this requirement a bit? It is currently an incomplete sentence and is thus less prescriptive than 1. Maybe there are actually two separate requirements here?

Suggested change

2. Multiple credentials sets for virtual chunk resolution

2. Virtual chunks within a single repo can be stored in multiple different object storage buckets in different platforms

2.5. Users can specify different credentials for different object storage locations

Having written this down, it seems like 2 is a really broad and tricky requirement to satisfy. We could consider narrowing it to the following.

2. Virtual chunks within a repo can be stored in _one single object storage bucket in a specific cloud platform_. Mixing virtual references from multiple buckets in a single repo is not allowed.

I'm wondering if that would be too restrictive. It would certainly simply if the credentials story a lot.

@TomNicholas - can you think of scenarios where you would want to mix multiple different buckets in a sigle repo?

I think it would be useful to create virtual stores that would allow people to open a data tree containing both raw data and overviews. The raw data would commonly be in a separate bucket from the overviews.

Can you elaborate? Why? Could the overview be native writes to the repo?

Can you elaborate? Why? Could the overview be native writes to the repo?

To be fair, I'm not sure how common this is but the primary case in which the overviews cannot be native writes to the repo is when they already exist as Zarr V2 data.

A more common scenario would be continuously adding NetCDF references as virtual chunks along with continuously adding Zarr overviews. My understanding from previous conversations was that once native data is committed to a repository, all new commits will need to be native rather than virtual. Is that incorrect? Could one continuously add virtual chunks for the full resolution data and native chunks for the overviews?

Could one continuously add virtual chunks for the full resolution data and native chunks for the overviews?

Icechunk does not place restrictions on which chunks are native and which are virtual

Icechunk does not place restrictions on which chunks are native and which are virtual

Yes you should already be able to create "mixed" icechunk stores using xarray and virtualizarr. Aimee's append_dim PR goes even further because it is one way to create a mixed array, where you could append native chunks to an array containing virtual chunks (see zarr-developers/VirtualiZarr#272 (comment)).

@TomNicholas - can you think of scenarios where you would want to mix multiple different buckets in a single repo?

If you wanted to use virtual references to create some kind of aggregated dataset, e.g. put a bunch of disparate model run datasets that together comprise CMIP6 into one virtual store.

Those model runs could be in different buckets (or even clouds...) from different data providers. This kind of scenario is a little more likely to occur across groups in a store rather than within a single group/array IMO - think a DataTree where each group came from a different source. I think its debatable whether that use case should be targeted by some kind of higher-level catalog instead though.

a DataTree where each group came from a different source. I think its debatable whether that use case should be targeted by some kind of higher-level catalog instead though.

I would say that absolutely this looks like a single dataset object

It's a good question: how deep of a hierarchy should we expect to store in a single Icechunk repo? Could we put all of CMIP6 in a single repo?

This is a problem we have also wrestled with in Arraylake.

If nothing in Icechunk scales as O(N) where N is the number of arrays / groups / chunks / reference / etc., then it is possible.

However, the current design doesn't scale well to this case because all of the metadata for the whole store go in one single snapshot file.

yes there is an extremely thin line between "catalog" and "manifest" with arbitrary trees.

I think things look much better in the latest commit. Please take a look and tell me what you think. The solution seems much simpler, and still allows all kind of mixes of object stores, prefixes, credentials, etc.

rabernat · 2024-12-04T02:11:44Z

design-docs/001-virtual-chunk-improvements.md

+
+1. Virtual chunks can be stored in all supported object stores
+2. Multiple credentials sets for virtual chunk resolution
+3. Optimized manifests where the same prefix doesn't need to be repeated for every virtual chunk


Suggested change

3. Optimized manifests where the same prefix doesn't need to be repeated for every virtual chunk

3. Efficient storage of chunk references, which often contain highly redundant prefixes

Also very important to clarify what is meant by "efficient" or "optimized" -- specifically whether we are trying to optimize for on-disk storage vs. in memory representation vs. lookup time or something else.

Good feedback, I was too close to the problem and failed to add this context. This requirement has been dropped for now, it's an in-mem optimization that we can do at a later time, if needed.

rabernat · 2024-12-04T03:04:02Z

design-docs/001-virtual-chunk-improvements.md

+
+In the first url style above, the prefix could be the full file, `s3://some-bucket/some-prefix/some-file.nc`. In the second example, there are options for the prefix, it could be `s3://some-bucket/some-prefix/c/0/0/1`, or `s3://some-bucket/some-prefix/c/0/0` or `s3://some-bucket/some-prefix/c/0` or `s3://some-bucket/some-prefix/c`, with different numbers of chunks under it in each case.
+
+Intelligence is needed to optimize what is consider a good prefix. If done badly, we could have:


Is intelligence really needed? Or would an off-the shelf compression algorithm handle it just fine?

I decided to test this with my favorite little storage library

import zarr import numpy as np refix = 's3://some-bucket/some-prefix/c/0/' paths = np.array([str(i) for i in range(10000)]) full_paths = np.array([prefix + p for p in paths]) store = zarr.MemoryStore() group = zarr.group(store=store) path_array = group.create("paths", shape=len(paths), chunks=len(paths), dtype=str, compressor=zarr.Zstd()) path_array[:] = paths full_path_array = group.create("full_paths", shape=len(paths), chunks=len(paths), dtype=str, compressor=zarr.Zstd()) full_path_array[:] = full_paths print(path_array.nbytes_stored) print(full_path_array.nbytes_stored) # 6579 # 7048

So storing the full paths and then using a zstd compressor on the raw bytes creates only 7% more data than just storing the last part of the path.

This is what compression algorithms do: find redundant patterns in data. Should we really spend time trying to implement our own compression scheme for prefixes?

This is my failure to clarify that we were looking for an in-memory optimization. Currently Icechunk lacks a layer of separation between the in memory and on-disk datastructures (that will have to come soon) which made things even more confusing.

rabernat · 2024-12-04T03:10:21Z

design-docs/001-virtual-chunk-improvements.md

+
+The advantage of per snapshot is that we don't need to load/parse all the prefixes if the current snapshot uses only a few. A case for this would be a virtual dataset that has been materialized in a newer version. If prefixes are per-snapshot we wouldn't need to load any prefixes.
+
+On the other hand, requirement 4 asks for the full list of "dependencies" for the dataset, and that means for all versions, which requires a global list of prefixes. Without a global list, we would need to fetch the list of prefixes of each snapshot.


I think we should sharpen the requirement about what exactly these dependencies are. Does the prefix have to be part of it? That feels like a bit of a separate concern. The point of the dependencies is to know what credentials will be required and where on the internet the virtual chunks might live. For these purposes, I'd suggest it's enough to know

The platform (e.g. S3)

The bucket name

[Optional] The region

[Optional] The endpoint URL

I think this is much more clear in my latest commit.

rabernat · 2024-12-04T03:11:26Z

design-docs/001-virtual-chunk-improvements.md

+
+### Who creates/decides the prefixes
+
+Obviously, generating the list of prefixes given the list of virtual chunks is doable automatically. But it's a tricky optimization problem. One strategy could be to generate a trie with all chunk urls, and then use some heuristic to select prefixes using heavily populated nodes. It's tricky and potentially computationally expensive for large repos.


This is where simply defining prefix as {protocol}://{bucket} might save us a lot of trouble, and then we let a standard compression scheme handle the rest.

rabernat · 2024-12-04T03:18:02Z

design-docs/001-virtual-chunk-improvements.md

+}
+```
+
+* Users can add to the list and edit members but not delete.


But does garbage collection potentially do this?

I don't think we can meaningfully distinguish between "users" and "admins" of a repo. Currently the only way to update a repo is to make a commit. If we start making out-of-band ways to change a repo, don't we open ourselves up to consistency problems?

rabernat · 2024-12-04T03:26:44Z

design-docs/001-virtual-chunk-improvements.md

+
+* The list of chunk containers is stored in a new global `Config` object. There is a single configuration per repo, and all reads and writes are done reading the latest configuration. In the initial implementation, the "history" of configuration is present only for reference and user inspection by hand.
+
+TBD: how to store this information. One approach would be using the new PUT if-match S3 functionality, but this would be the first place where we overwrite objects.


Can you update Config and commit to a branch at the same time? What if a branch commit succeeds but Config update fails? Or vice versa? S3 makes no promises about transactions across multiple keys...that's what we have icechunk for.

This whole concept feels like a very slippery slope of nasty edge cases.

However, I see it may be necessary if we want to expose information that transcends an individual snapshot.

Here's an alternative approach: we could store the Config directly inside the branch or tag file as JSON. Obviously this does not scale to large numbers of VirtualChunkContainers. But it would have these benefits:

Does not increase the number of GET requests to open a store

Guaranteed consistency between commit and Config

The downside is that, to gather all of the dependencies for a repo, we have to read all of the tag and branch files (note: we never have to read the branch history, only ever the latest branch file). This number will probably be small for most repos.

rabernat · 2024-12-04T03:27:52Z

design-docs/001-virtual-chunk-improvements.md

+struct Config {
+  inline_chunk_threshold_bytes: u16,
+  virtual_chunk_containers: Vec<VirtualChunkContainer>,
+}


In general, I really like the idea of repo-level config stored with the repo. Iceberg has this and it seems really useful. Could eventually put all kinds of stuff here.

rabernat · 2024-12-04T03:30:58Z

design-docs/001-virtual-chunk-improvements.md

+}
+```
+
+* The `last_modified` field contains an optional unix epoch in seconds. If set for a chunk, retrieval will fail if the file was modified after this time. We use this instead of an ETag or checksum to save space on the manifests. `u32` should be enough until year ~ 2100.


This feels like an unnecessary space optimization. ETag is clearly meant for this purpose. ETag has several advantages over timestamp. In particular, if the file is overwritten with the same data, the ETag doesn't change. Also, no timezone headaches to worry about.

At least we could make this a configuration choice and leave it up to the user.

overwritten with the same data, the ETag doesn't change

This is only true (on S3) with a single PUT. For multi-part-uploads, the chunks sizes need to match too and maybe something else too.

In the latest commit, I'm allowing last_modified and etag (or nothing). What I REALLY like about last_modified is that you can "invent" it, you don't really need to fetch every etag for every object from the store. If you just pass now as last_modified, it will work in most cases.

rabernat · 2024-12-04T03:41:51Z

design-docs/001-virtual-chunk-improvements.md

+
+### Reading virtual chunks
+
+* When a user create a repository instance, they can specify credentials for each `VirtualChunkContainer`, pointing to them by name.


I see, name is used to associate credentials with containers.

Do we have a feeling for how many VirtualChunkContainers we exect a typical repo to have?

Mixing the prefix into the VirtualChunkContainer will complicate this quite a bit. Credentials will generally be scoped to buckets, not prefixes. So if there are many VirtualChunkContainers with the same bucket, they would have to redundantly specify credentials for each one.

This problem goes away if you basically make each VirtualChunkContainer 1:1 with a set of credentials.

Credentials will generally be scoped to buckets, not prefixes.

All backends support intra-bucket permissioning. In addition, you might have the same bucket name at different endpoints.

Right, allowing more than one set of credentials for different paths of a bucket was part of the design (unfortunately as an unlisted requirement now that I think about it). It's also possible in my new commit.

* Dropped manifest optimization goal * Dropped the performance goal for fetching all required virtual containers * Much simplified and optional mechanism for defining virtual containers * Simplified interaction with VirtualiZarr * Added optional Etag * More specifics about `Config` * Mode details on the algorithm for virtual chunk location resolution

rabernat · 2024-12-06T23:04:37Z

design-docs/001-virtual-chunk-improvements.md

+
+* s3://
+* gcs://
+* tigris://  (is this the right protocol)


No. it's just S3 with a custom endpoint URL.

There is also credentials... We may need to know it's tigris so we manage the generation of credentials differently from say localhost minio. Of course, we can always play games with the endpoint and figure out, but this seems more future-proof?

we manage the generation of credentials differently from say localhost minio

I recommend you keep to using ("s3") for everything that indeed matches that protocol, and the rest is in the configuration for the individual "containers" - isn't that the whole point? I don't think you want to maintain a potentially endless list of protocols for each and service that might have some icy data.

generation of credentials

What does this mean?

By the way, how exactly are chunk URLs mapped to "containers", i.e., how do you know which container a particular URL belongs to?

One downside of using s3:// everywhere is that it forces users to actually create the VirtualChunkContainers, even for some very common object stores like Google's. Using the protocol would be a bit easier on them for the very common case of a simple repo on google or tigris. But I take your point, which ones are s3:// and which ones are a protocol is arbitrary. Will think more about it.

how exactly are chunk URLs mapped to "containers",

@martindurant a rough description of the algorithm is documented here

One downside of using s3:// everywhere is that it forces users to actually create the VirtualChunkContainers, even for some very common object stores like Google's.

?? google's storage does not use the s3 protocol. When I say "protocol", mean exactly that: a set of rules and conventions for how to communicate. S3's is defined by AWS but implemented by many services. GCS is only used by google (as far as I know).

a rough description of the algorithm

made some additional comments

martindurant

Some comments on "container", which I think had better be renamed "prefixes", since hat;'s exactly what they are.

martindurant · 2024-12-09T19:06:42Z

design-docs/001-virtual-chunk-improvements.md

+
+This is how the mapping from location to cloud+bucket+prefix+credentials works:
+
+* Using the `location` protocol we find all containers that match the protocol


I don't know what this means

I rewrote the line.

martindurant · 2024-12-09T19:07:35Z

design-docs/001-virtual-chunk-improvements.md

+
+* Using the `location` protocol we find all containers that match the protocol
+* We sort the matching containers in order of descending prefix length
+* We find the first one that has a `bucket_prefix` that is a prefix of the `location` path (this will be the most specific container that matches)


What do buckets have anything to do with here?
Prefix is a string of any length, no?

The chunk's URL is s3://foo/bar/baz.nc, we need to find a container that matches it to know what set of credentials and object store client configuration to use. If there are two containers that work for protocol s3, the one with bucket_prefix = foo/bar will match, instead of the one with bucket_prefix = hello or the one with bucket_prefix = foo.

"We find the container with the longest prefix matching the given URL"

martindurant · 2024-12-09T19:11:21Z

design-docs/001-virtual-chunk-improvements.md

+* We use the rest of the data in the matched container to instantiate an object store client (or retrieve it from cache)
+* Credentials for the client must be passed by the user when the repository is instantiated, by indicating the container name to which they apply
+
+We have one default container for each supported object store, the container has protocol and name s3, gcs, tigris, etc. and has an empty `bucket_prefix` and no region configured. These default containers, just like any others set in the persistent repo configuration, can be overridden when the repository is instantiated.


So, to be precise, you are envisioning a structure like:

[{"name": "acontainer", "prefix": "gcs://bucket/path/prefix", "options": {"region": "US-EAST1"}}, {...} ]

where the "name" handle is what the user passes to change the default?

and "prefix" is literal, it matches any paths that start with that string.

martindurant · 2024-12-09T19:13:03Z

design-docs/001-virtual-chunk-improvements.md

+When a region is not indicated and the platform needs it, the logic is:
+
+* For the platform in which the Icechunk repo is stored, we use the same region as the repo
+* For all other platforms we use the default region for that platform


Oh, so you ALSO envision default options per_protocol ? So you have some sort of options merging, from less specific to more specific? It will take care to allow the user to override at any level.

martindurant · 2024-12-09T19:40:36Z

design-docs/001-virtual-chunk-improvements.md


-* Using the `location` protocol we find all containers that match the protocol
+* Using the chunk location's protocol, we find all containers that have a matching `protocol` field


It's just the first few characters of the prefix - why separate it out from the prefix in general? Sometimes the prefix might be "s3", sometimes "s3://bucket", sometimes "s3://bucket/path/to". Aside from knowing that these particular examples are all s3 protocol, I don't think that there's any difference in how you should treat them.

I like this. My remaining concern is, how to make it simple for the simple use case. Say, virtual chunks in google storage, repository in AWS. I don't want users to have to create a container for this simple case. That's why I'm proposing to distinguish the platform by the url protocol (by default).

Then you have two:

"s3"

"gs"
(and they happen to exactly look like what you would expect as protocols).

You could even have the default default be "", i.e., matches any prefix that's not matched by something more specific.

Virtual chunks design document

4bcd642

paraseba requested review from rabernat, mpiannucci and dcherian December 4, 2024 01:36

paraseba mentioned this pull request Dec 4, 2024

Virtual chunks improvements #430

Closed

martindurant reviewed Dec 4, 2024

View reviewed changes

maxrjones reviewed Dec 4, 2024

View reviewed changes

martindurant reviewed Dec 4, 2024

View reviewed changes

rabernat reviewed Dec 4, 2024

View reviewed changes

rabernat reviewed Dec 6, 2024

View reviewed changes

paraseba added 2 commits December 9, 2024 14:22

typos

e3cc1a8

Merge branch 'main' into push-lrrswtxsvoyy

460b3db

martindurant reviewed Dec 9, 2024

View reviewed changes

typos

dadc3df

martindurant reviewed Dec 9, 2024

View reviewed changes


		### Prefixes

		Storing the full URL for each chunk duplicates a lot of data in the manifest, making it large. The idea is to try to extract common _prefixes_ and identify them with a short id. Then the manifest only needs to identify the prefix id + maybe an extra string + length + offset.


		## What we need

		1. Virtual chunks can be stored in all supported object stores

		@@ -0,0 +1,163 @@
		# Improvements to virtual chunks handling

		## What we need


		6. > Object level change detection (not serving a virtual chunk if the containing file changed)

		We optionally store the last-modified timestamp of the object containing each chunk. This is not optimal, a hash would be better, but it takes more space. Also not optimal, storing it per chunk, when many chunks come from the same object, but we don't have other easy place to store it if they are using containers that include more than one object. We could add more configuration to optimize this in the future.


		The advantage of per snapshot is that we don't need to load/parse all the prefixes if the current snapshot uses only a few. A case for this would be a virtual dataset that has been materialized in a newer version. If prefixes are per-snapshot we wouldn't need to load any prefixes.

		On the other hand, requirement 4 asks for the full list of "dependencies" for the dataset, and that means for all versions, which requires a global list of prefixes. Without a global list, we would need to fetch the list of prefixes of each snapshot.


		### Supporting multiple clouds

		When the user adds a virtual ref, they need to indicate the details of the platform to retrieve it. These include what cloud it is (so we know how to set up credentials), region, endpoint url, port and other settings. It looks a lot like our `S3Config` type (or an expanded version of that).

	2. Multiple credentials sets for virtual chunk resolution
	2. Virtual chunks within a single repo can be stored in multiple different object storage buckets in different platforms
	2.5. Users can specify different credentials for different object storage locations

	3. Optimized manifests where the same prefix doesn't need to be repeated for every virtual chunk
	3. Efficient storage of chunk references, which often contain highly redundant prefixes


		In the first url style above, the prefix could be the full file, `s3://some-bucket/some-prefix/some-file.nc`. In the second example, there are options for the prefix, it could be `s3://some-bucket/some-prefix/c/0/0/1`, or `s3://some-bucket/some-prefix/c/0/0` or `s3://some-bucket/some-prefix/c/0` or `s3://some-bucket/some-prefix/c`, with different numbers of chunks under it in each case.

		Intelligence is needed to optimize what is consider a good prefix. If done badly, we could have:


		### Who creates/decides the prefixes

		Obviously, generating the list of prefixes given the list of virtual chunks is doable automatically. But it's a tricky optimization problem. One strategy could be to generate a trie with all chunk urls, and then use some heuristic to select prefixes using heavily populated nodes. It's tricky and potentially computationally expensive for large repos.


		* The list of chunk containers is stored in a new global `Config` object. There is a single configuration per repo, and all reads and writes are done reading the latest configuration. In the initial implementation, the "history" of configuration is present only for reference and user inspection by hand.

		TBD: how to store this information. One approach would be using the new PUT if-match S3 functionality, but this would be the first place where we overwrite objects.


		### Reading virtual chunks

		* When a user create a repository instance, they can specify credentials for each `VirtualChunkContainer`, pointing to them by name.


		This is how the mapping from location to cloud+bucket+prefix+credentials works:

		* Using the `location` protocol we find all containers that match the protocol


		* Using the `location` protocol we find all containers that match the protocol
		* Using the chunk location's protocol, we find all containers that have a matching `protocol` field

Virtual chunks design document #436

Are you sure you want to change the base?

Virtual chunks design document #436

Conversation

paraseba commented Dec 4, 2024 • edited Loading

paraseba commented Dec 4, 2024

paraseba commented Dec 4, 2024

Choose a reason for hiding this comment

maxrjones left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rabernat left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

martindurant left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paraseba commented Dec 4, 2024 •

edited

Loading

rabernat left a comment •

edited

Loading