feat(iroh-bytes): Batch blob api #2339

rklaehn · 2024-06-03T14:57:31Z

Description

This adds a new API for creating blobs.

You create a batch. Within a batch you got all the usual operations to add stuff, like add_bytes, add_stream, add_from_path etc. Notable differences to the existing api:

All operations work on individual blobs, so no way to add entire subdirectories with add_from_path
All operations return a TempTag instead of having a set tag option

The way to use the API is to just perform a complex operation within a batch, and then at the end assign a (non temporary) tag to the root(s) of the created data before dropping the batch.

It is possible to scan a directory and create a collection purely on the client side, so the code to traverse a directory can be removed from the node.

To allow the workflow described above, the tags client has been extended to allow manually setting a tag.

Ideally this API would entirely replace the current blobs API, so all read ops would always happen within the context of a batch.

Breaking Changes

At this point mostly adding stuff, but changing the RPC api for setting tags. I might decide to remove the entire non batch mutation API in this PR.

How it works is to leave a streaming RPC call open for each batch, then do operations in the context of an unique identifier for this RPC call.

Notes & open questions

Note: if things work out every add operation refers to a single blob, and the aggregation of many blobs can be driven from the client. That means that a lot of the complexity of the progress events like ids etc. can be removed from the rpc. This still needs to exist, but can be confined to the client.

Todo

~~Add back fine grained progress~~
fine grained progress is not needed for all ops, but definitely for add_file and add_dir. Possibly for add_bytes and add_reader. Not sure if it is OK to have it in all cases despite it typically not being used.
~~Purge all tag setting stuff from the blobs API and the downloader.~~

I would propose that we merge this initially as an addition, and do the stripdown of the other APIs in a subsequent PR. Also, if this is an addition we can do the fine grained progress for add_dir in a subsequent PR as well.

Change checklist

Self-review.
Documentation updates if relevant.
Tests if relevant.
All breaking changes documented.

I found that in most places it is not needed, and if somebody needs them to be clone they can always wrap them in an Arc. Also, this complicates extending the concept of temp tags via the rpc boundary. With this change if will be possible to use the same TempTag type both in the low level blobs store API and in the higher level blobs API of iroh.

rklaehn · 2024-06-04T09:10:03Z

Here is the preliminary API for batches:

pub async fn add_bytes(&self, bytes: impl Into<Bytes>, format: BlobFormat) -> Result<TempTag> {
pub async fn add_file(&self, path: PathBuf, import_mode: ImportMode, format: BlobFormat) -> Result<(TempTag, u64)> {
pub async fn add_dir(&self, root: PathBuf, import_mode: ImportMode, wrap: WrapOption) -> Result<TempTag> {
pub async fn add_collection(&self, collection: Collection) -> Result<TempTag> {
pub async fn add_stream(&self, mut input: impl Stream<Item = io::Result<Bytes>> + Send + Unpin + 'static, format: BlobFormat) -> Result<TempTag> {
pub async fn add_blob_seq(&self, iter: impl Iterator<Item = Bytes>) -> Result<TempTag> {
pub async fn temp_tag(&self, content: HashAndFormat) -> Result<TempTag> {

Basically very similar to the normal blobs api, but there are no options to create tags. Instead every fn returns a temp tag for the thing that has been created, that the user can then later assign to a permanent tag (or not).

The tags API has been extended to allow creating a tag given a hash and format.

Many of these functions are convenience functions. Probably most notably, add_dir is now traversing the file system on the client side and doing multiple add_file calls.

dignifiedquire · 2024-06-04T09:47:52Z

should there be the equivalent delete versions as well`

rklaehn · 2024-06-04T09:59:56Z

should there be the equivalent delete versions as well`

WDYM? You delete stuff by ensuring that it is no longer tagged in some way, then GC will take care of it. There is blob delete, but that is really a low level function that you should rarely use directly.

delete_blob will just do it's thing no matter what temp tags there are, so it can live in the blobs API not in the batch API.

iroh/src/client/blobs.rs

iroh/src/node.rs

- document batches - _with_opts fns to simplify the simple cases

Still exported in the blobs module though.

# Conflicts: # iroh/src/node.rs

also remove some println!

rklaehn · 2024-06-05T11:08:38Z

iroh-blobs/src/store/traits.rs

@@ -700,7 +703,7 @@ pub enum ImportProgress {
 /// does not make any sense. E.g. an in memory implementation will always have
 /// to copy the file into memory. Also, a disk based implementation might choose
 /// to copy small files even if the mode is `Reference`.
-#[derive(Debug, Clone, Copy, PartialEq, Eq, Default)]
+#[derive(Debug, Clone, Copy, PartialEq, Eq, Default, Serialize, Deserialize)]
 pub enum ImportMode {


I reexport this from the blobs api since I don't want to newtype it...

iroh/src/client/blobs.rs

rklaehn · 2024-06-05T11:12:24Z

iroh/src/rpc_protocol.rs

@@ -177,6 +208,41 @@ impl ServerStreamingMsg<RpcService> for BlobValidateRequest {
    type Response = ValidateProgress;
 }

+/// Get the status of a blob


this is mostly unrelated to the PR. Just a way to check if you have a thing without downloading it all via the rpc.

move around the BlobStatus, use async lock

iroh/src/client/blobs.rs

…e client side (#2349) ## Description A collection is just one particular way to use a hashseq, so it feels a bit weird to have it baked in to the iroh node. With this we can move some of it into the client. This is a part of #2272 . We can make more similar changes once we have the batch API #2339 . ## Breaking Changes  ## Notes & open questions Note: I closed #2272 because half of the changes in that PR are here, the other half will be part of the batch PR, and moving collections into iroh I am not convinced of yet... ## Change checklist - [x] Self-review. - [x] Documentation updates if relevant. - [x] Tests if relevant. - [x] All breaking changes documented.

# Conflicts: # iroh/src/client/blobs.rs # iroh/src/node/rpc.rs # iroh/src/rpc_protocol.rs

# Conflicts: # iroh/src/client/blobs.rs # iroh/src/client/tags.rs # iroh/src/node/rpc.rs # iroh/src/rpc_protocol.rs

# Conflicts: # iroh/src/client/blobs.rs # iroh/src/client/tags.rs # iroh/src/node.rs # iroh/src/node/builder.rs # iroh/src/node/rpc.rs

dignifiedquire · 2024-06-28T09:42:55Z

@rklaehn what's the state of this?

rklaehn · 2024-06-28T09:56:28Z

@rklaehn what's the state of this?

Just merged with main. I want to do another self-review, but currently trying to keep all the stuff up to date with main so it does not bitrot...

iroh-blobs/src/provider.rs

divagant-martian · 2024-06-28T14:55:27Z

iroh-blobs/src/util.rs

    /// Keep the item alive until the end of the process
    pub fn leak(mut self) {
        // set the liveness tracker to None, so that the refcount is not decreased
        // during drop. This means that the refcount will never reach 0 and the
-        // item will not be gced until the end of the process.
+        // item will not be gced until the end of the process, unless you manually
+        // invoke on_drop.


this reads like it should be moved from a comment to documentation

divagant-martian · 2024-06-28T14:59:38Z

iroh/src/client.rs

-    quic_rpc::RpcClient<RpcService, quic_rpc::transport::boxed::Connection<RpcService>>;
+pub(crate) type RpcConnection = quic_rpc::transport::boxed::Connection<RpcService>;
+
+/// Iroh rpc client - boxed so that we can have a concrete type.


not seeing the boxed here from just looking at the type, maybe the docs could be more clear in that regard?

iroh/src/client/blobs.rs

divagant-martian · 2024-06-28T15:14:56Z

iroh/src/client/tags.rs

+
+    /// Set a tag to a value, overwriting any existing value.
+    ///
+    /// This is a convenience wrapper around `set_opt`.


I was about to suggest [set_opt] to ensure docs always refer an existing method but I can't find it. I assume this is meant to be

Suggested change

/// This is a convenience wrapper around `set_opt`.

/// This is a convenience wrapper around [`set_with_opt`].

divagant-martian · 2024-06-28T15:15:29Z

iroh/src/client/tags.rs

    /// Delete a tag.
+    ///
+    /// This is a convenience wrapper around `set_opt`.


divagant-martian · 2024-06-28T15:55:07Z

iroh/src/node.rs

+            for (content, count) in scope.tags {
+                if let Some(tag_drop) = tag_drop {


inverting this produces the same logic but a single conditional check

divagant-martian · 2024-06-28T15:56:07Z

iroh/src/node.rs

+    }
+
+    /// Remove an entire batch.
+    fn remove(&mut self, batch: BatchId, tag_drop: Option<&dyn TagDrop>) {


leak is always called on store. What happens if remove and remove_one are called without a TagDrop?

dignifiedquire · 2024-07-29T10:58:47Z

closing in favor of #2545

## Description This is the third attempt to add a batch API for adding blobs. Previous one was #2339 The basic idea is the following: all changes to the store happen in the context of a _batch_. All write operations within a batch produce temp tags. These temp tags are scoped to the batch and keep the data alive as long as the batch exists. At some point, the API user has to upgrade one or more temp tags to permanent tags. All non-batch operations would long term be implemented in terms of batch operations. In a second step, the following rpc calls would be replaced by their batch equivalent. - AddStream - AddPath - CreateCollection The third one is very nice, since it means that the notion of a collection (as in a special kind of hashseq) no longer has to even exist in the node code. ## Breaking Changes - iroh::client::blobs::BlobStatus has a new case NotFound - iroh::client::blobs::BlobStatus::Partial: size is now a BaoBlobSize instead of a u64 All other public changes are adding of new APIs. ## Notes & open questions Note: in the previous version I had an optimisation to avoid storing TempTags in the case where there are multiple TempTags with the same hash. I removed this to keep things simple. We can add it back later. ## Change checklist - [ ] Self-review. - [ ] Documentation updates following the [style guide](https://rust-lang.github.io/rfcs/1574-more-api-documentation-conventions.html#appendix-a-full-conventions-text), if relevant. - [ ] Tests if relevant. - [ ] All breaking changes documented. --------- Co-authored-by: Philipp Krüger <[email protected]>

…e client side (#2349) ## Description A collection is just one particular way to use a hashseq, so it feels a bit weird to have it baked in to the iroh node. With this we can move some of it into the client. This is a part of n0-computer/iroh#2272 . We can make more similar changes once we have the batch API n0-computer/iroh#2339 . ## Breaking Changes  ## Notes & open questions Note: I closed #2272 because half of the changes in that PR are here, the other half will be part of the batch PR, and moving collections into iroh I am not convinced of yet... ## Change checklist - [x] Self-review. - [x] Documentation updates if relevant. - [x] Tests if relevant. - [x] All breaking changes documented.

## Description This is the third attempt to add a batch API for adding blobs. Previous one was n0-computer/iroh#2339 The basic idea is the following: all changes to the store happen in the context of a _batch_. All write operations within a batch produce temp tags. These temp tags are scoped to the batch and keep the data alive as long as the batch exists. At some point, the API user has to upgrade one or more temp tags to permanent tags. All non-batch operations would long term be implemented in terms of batch operations. In a second step, the following rpc calls would be replaced by their batch equivalent. - AddStream - AddPath - CreateCollection The third one is very nice, since it means that the notion of a collection (as in a special kind of hashseq) no longer has to even exist in the node code. ## Breaking Changes - iroh::client::blobs::BlobStatus has a new case NotFound - iroh::client::blobs::BlobStatus::Partial: size is now a BaoBlobSize instead of a u64 All other public changes are adding of new APIs. ## Notes & open questions Note: in the previous version I had an optimisation to avoid storing TempTags in the case where there are multiple TempTags with the same hash. I removed this to keep things simple. We can add it back later. ## Change checklist - [ ] Self-review. - [ ] Documentation updates following the [style guide](https://rust-lang.github.io/rfcs/1574-more-api-documentation-conventions.html#appendix-a-full-conventions-text), if relevant. - [ ] Tests if relevant. - [ ] All breaking changes documented. --------- Co-authored-by: Philipp Krüger <[email protected]>

…e client side (#2349) ## Description A collection is just one particular way to use a hashseq, so it feels a bit weird to have it baked in to the iroh node. With this we can move some of it into the client. This is a part of n0-computer/iroh#2272 . We can make more similar changes once we have the batch API n0-computer/iroh#2339 . ## Breaking Changes  ## Notes & open questions Note: I closed #2272 because half of the changes in that PR are here, the other half will be part of the batch PR, and moving collections into iroh I am not convinced of yet... ## Change checklist - [x] Self-review. - [x] Documentation updates if relevant. - [x] Tests if relevant. - [x] All breaking changes documented.

## Description This is the third attempt to add a batch API for adding blobs. Previous one was n0-computer/iroh#2339 The basic idea is the following: all changes to the store happen in the context of a _batch_. All write operations within a batch produce temp tags. These temp tags are scoped to the batch and keep the data alive as long as the batch exists. At some point, the API user has to upgrade one or more temp tags to permanent tags. All non-batch operations would long term be implemented in terms of batch operations. In a second step, the following rpc calls would be replaced by their batch equivalent. - AddStream - AddPath - CreateCollection The third one is very nice, since it means that the notion of a collection (as in a special kind of hashseq) no longer has to even exist in the node code. ## Breaking Changes - iroh::client::blobs::BlobStatus has a new case NotFound - iroh::client::blobs::BlobStatus::Partial: size is now a BaoBlobSize instead of a u64 All other public changes are adding of new APIs. ## Notes & open questions Note: in the previous version I had an optimisation to avoid storing TempTags in the case where there are multiple TempTags with the same hash. I removed this to keep things simple. We can add it back later. ## Change checklist - [ ] Self-review. - [ ] Documentation updates following the [style guide](https://rust-lang.github.io/rfcs/1574-more-api-documentation-conventions.html#appendix-a-full-conventions-text), if relevant. - [ ] Tests if relevant. - [ ] All breaking changes documented. --------- Co-authored-by: Philipp Krüger <[email protected]>

rklaehn added 7 commits June 3, 2024 11:24

refactor: DRY the temp tag creation

a7842d5

refactor: some renaming

573051c

WIP add batch API

b443bec

WIP make batch API work and add some tests

23c90d3

Add ability to add from path in batch api

bf0baea

WIP add more rich tags api

08ccd31

clippy

978587a