-
Notifications
You must be signed in to change notification settings - Fork 171
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(iroh-bytes): Batch blob api #2339
Conversation
I found that in most places it is not needed, and if somebody needs them to be clone they can always wrap them in an Arc. Also, this complicates extending the concept of temp tags via the rpc boundary. With this change if will be possible to use the same TempTag type both in the low level blobs store API and in the higher level blobs API of iroh.
Here is the preliminary API for batches: pub async fn add_bytes(&self, bytes: impl Into<Bytes>, format: BlobFormat) -> Result<TempTag> {
pub async fn add_file(&self, path: PathBuf, import_mode: ImportMode, format: BlobFormat) -> Result<(TempTag, u64)> {
pub async fn add_dir(&self, root: PathBuf, import_mode: ImportMode, wrap: WrapOption) -> Result<TempTag> {
pub async fn add_collection(&self, collection: Collection) -> Result<TempTag> {
pub async fn add_stream(&self, mut input: impl Stream<Item = io::Result<Bytes>> + Send + Unpin + 'static, format: BlobFormat) -> Result<TempTag> {
pub async fn add_blob_seq(&self, iter: impl Iterator<Item = Bytes>) -> Result<TempTag> {
pub async fn temp_tag(&self, content: HashAndFormat) -> Result<TempTag> { Basically very similar to the normal blobs api, but there are no options to create tags. Instead every fn returns a temp tag for the thing that has been created, that the user can then later assign to a permanent tag (or not). The tags API has been extended to allow creating a tag given a hash and format. Many of these functions are convenience functions. Probably most notably, add_dir is now traversing the file system on the client side and doing multiple add_file calls. |
should there be the equivalent |
WDYM? You delete stuff by ensuring that it is no longer tagged in some way, then GC will take care of it. There is blob delete, but that is really a low level function that you should rarely use directly. delete_blob will just do it's thing no matter what temp tags there are, so it can live in the blobs API not in the batch API. |
- document batches - _with_opts fns to simplify the simple cases
Still exported in the blobs module though.
# Conflicts: # iroh/src/node.rs
also remove some println!
@@ -700,7 +703,7 @@ pub enum ImportProgress { | |||
/// does not make any sense. E.g. an in memory implementation will always have | |||
/// to copy the file into memory. Also, a disk based implementation might choose | |||
/// to copy small files even if the mode is `Reference`. | |||
#[derive(Debug, Clone, Copy, PartialEq, Eq, Default)] | |||
#[derive(Debug, Clone, Copy, PartialEq, Eq, Default, Serialize, Deserialize)] | |||
pub enum ImportMode { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reexport this from the blobs api since I don't want to newtype it...
@@ -177,6 +208,41 @@ impl ServerStreamingMsg<RpcService> for BlobValidateRequest { | |||
type Response = ValidateProgress; | |||
} | |||
|
|||
/// Get the status of a blob |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is mostly unrelated to the PR. Just a way to check if you have a thing without downloading it all via the rpc.
move around the BlobStatus, use async lock
…e client side (#2349) ## Description A collection is just one particular way to use a hashseq, so it feels a bit weird to have it baked in to the iroh node. With this we can move some of it into the client. This is a part of #2272 . We can make more similar changes once we have the batch API #2339 . ## Breaking Changes <!-- Optional, if there are any breaking changes document them, including how to migrate older code. --> ## Notes & open questions Note: I closed #2272 because half of the changes in that PR are here, the other half will be part of the batch PR, and moving collections into iroh I am not convinced of yet... ## Change checklist - [x] Self-review. - [x] Documentation updates if relevant. - [x] Tests if relevant. - [x] All breaking changes documented.
# Conflicts: # iroh/src/client/blobs.rs # iroh/src/node/rpc.rs # iroh/src/rpc_protocol.rs
b13a1e9
to
b8b0d80
Compare
# Conflicts: # iroh/src/client/blobs.rs # iroh/src/client/tags.rs # iroh/src/node/rpc.rs # iroh/src/rpc_protocol.rs
# Conflicts: # iroh/src/client/blobs.rs # iroh/src/client/tags.rs # iroh/src/node.rs # iroh/src/node/builder.rs # iroh/src/node/rpc.rs
@rklaehn what's the state of this? |
Just merged with main. I want to do another self-review, but currently trying to keep all the stuff up to date with main so it does not bitrot... |
/// Keep the item alive until the end of the process | ||
pub fn leak(mut self) { | ||
// set the liveness tracker to None, so that the refcount is not decreased | ||
// during drop. This means that the refcount will never reach 0 and the | ||
// item will not be gced until the end of the process. | ||
// item will not be gced until the end of the process, unless you manually | ||
// invoke on_drop. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this reads like it should be moved from a comment to documentation
quic_rpc::RpcClient<RpcService, quic_rpc::transport::boxed::Connection<RpcService>>; | ||
pub(crate) type RpcConnection = quic_rpc::transport::boxed::Connection<RpcService>; | ||
|
||
/// Iroh rpc client - boxed so that we can have a concrete type. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not seeing the boxed here from just looking at the type, maybe the docs could be more clear in that regard?
|
||
/// Set a tag to a value, overwriting any existing value. | ||
/// | ||
/// This is a convenience wrapper around `set_opt`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was about to suggest [set_opt
] to ensure docs always refer an existing method but I can't find it. I assume this is meant to be
/// This is a convenience wrapper around `set_opt`. | |
/// This is a convenience wrapper around [`set_with_opt`]. |
/// Delete a tag. | ||
/// | ||
/// This is a convenience wrapper around `set_opt`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same.
for (content, count) in scope.tags { | ||
if let Some(tag_drop) = tag_drop { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inverting this produces the same logic but a single conditional check
} | ||
|
||
/// Remove an entire batch. | ||
fn remove(&mut self, batch: BatchId, tag_drop: Option<&dyn TagDrop>) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
leak
is always called on store
. What happens if remove
and remove_one
are called without a TagDrop
?
closing in favor of #2545 |
## Description This is the third attempt to add a batch API for adding blobs. Previous one was #2339 The basic idea is the following: all changes to the store happen in the context of a _batch_. All write operations within a batch produce temp tags. These temp tags are scoped to the batch and keep the data alive as long as the batch exists. At some point, the API user has to upgrade one or more temp tags to permanent tags. All non-batch operations would long term be implemented in terms of batch operations. In a second step, the following rpc calls would be replaced by their batch equivalent. - AddStream - AddPath - CreateCollection The third one is very nice, since it means that the notion of a collection (as in a special kind of hashseq) no longer has to even exist in the node code. ## Breaking Changes - iroh::client::blobs::BlobStatus has a new case NotFound - iroh::client::blobs::BlobStatus::Partial: size is now a BaoBlobSize instead of a u64 All other public changes are adding of new APIs. ## Notes & open questions Note: in the previous version I had an optimisation to avoid storing TempTags in the case where there are multiple TempTags with the same hash. I removed this to keep things simple. We can add it back later. ## Change checklist - [ ] Self-review. - [ ] Documentation updates following the [style guide](https://rust-lang.github.io/rfcs/1574-more-api-documentation-conventions.html#appendix-a-full-conventions-text), if relevant. - [ ] Tests if relevant. - [ ] All breaking changes documented. --------- Co-authored-by: Philipp Krüger <[email protected]>
…e client side (#2349) ## Description A collection is just one particular way to use a hashseq, so it feels a bit weird to have it baked in to the iroh node. With this we can move some of it into the client. This is a part of n0-computer/iroh#2272 . We can make more similar changes once we have the batch API n0-computer/iroh#2339 . ## Breaking Changes <!-- Optional, if there are any breaking changes document them, including how to migrate older code. --> ## Notes & open questions Note: I closed #2272 because half of the changes in that PR are here, the other half will be part of the batch PR, and moving collections into iroh I am not convinced of yet... ## Change checklist - [x] Self-review. - [x] Documentation updates if relevant. - [x] Tests if relevant. - [x] All breaking changes documented.
## Description This is the third attempt to add a batch API for adding blobs. Previous one was n0-computer/iroh#2339 The basic idea is the following: all changes to the store happen in the context of a _batch_. All write operations within a batch produce temp tags. These temp tags are scoped to the batch and keep the data alive as long as the batch exists. At some point, the API user has to upgrade one or more temp tags to permanent tags. All non-batch operations would long term be implemented in terms of batch operations. In a second step, the following rpc calls would be replaced by their batch equivalent. - AddStream - AddPath - CreateCollection The third one is very nice, since it means that the notion of a collection (as in a special kind of hashseq) no longer has to even exist in the node code. ## Breaking Changes - iroh::client::blobs::BlobStatus has a new case NotFound - iroh::client::blobs::BlobStatus::Partial: size is now a BaoBlobSize instead of a u64 All other public changes are adding of new APIs. ## Notes & open questions Note: in the previous version I had an optimisation to avoid storing TempTags in the case where there are multiple TempTags with the same hash. I removed this to keep things simple. We can add it back later. ## Change checklist - [ ] Self-review. - [ ] Documentation updates following the [style guide](https://rust-lang.github.io/rfcs/1574-more-api-documentation-conventions.html#appendix-a-full-conventions-text), if relevant. - [ ] Tests if relevant. - [ ] All breaking changes documented. --------- Co-authored-by: Philipp Krüger <[email protected]>
…e client side (#2349) ## Description A collection is just one particular way to use a hashseq, so it feels a bit weird to have it baked in to the iroh node. With this we can move some of it into the client. This is a part of n0-computer/iroh#2272 . We can make more similar changes once we have the batch API n0-computer/iroh#2339 . ## Breaking Changes <!-- Optional, if there are any breaking changes document them, including how to migrate older code. --> ## Notes & open questions Note: I closed #2272 because half of the changes in that PR are here, the other half will be part of the batch PR, and moving collections into iroh I am not convinced of yet... ## Change checklist - [x] Self-review. - [x] Documentation updates if relevant. - [x] Tests if relevant. - [x] All breaking changes documented.
## Description This is the third attempt to add a batch API for adding blobs. Previous one was n0-computer/iroh#2339 The basic idea is the following: all changes to the store happen in the context of a _batch_. All write operations within a batch produce temp tags. These temp tags are scoped to the batch and keep the data alive as long as the batch exists. At some point, the API user has to upgrade one or more temp tags to permanent tags. All non-batch operations would long term be implemented in terms of batch operations. In a second step, the following rpc calls would be replaced by their batch equivalent. - AddStream - AddPath - CreateCollection The third one is very nice, since it means that the notion of a collection (as in a special kind of hashseq) no longer has to even exist in the node code. ## Breaking Changes - iroh::client::blobs::BlobStatus has a new case NotFound - iroh::client::blobs::BlobStatus::Partial: size is now a BaoBlobSize instead of a u64 All other public changes are adding of new APIs. ## Notes & open questions Note: in the previous version I had an optimisation to avoid storing TempTags in the case where there are multiple TempTags with the same hash. I removed this to keep things simple. We can add it back later. ## Change checklist - [ ] Self-review. - [ ] Documentation updates following the [style guide](https://rust-lang.github.io/rfcs/1574-more-api-documentation-conventions.html#appendix-a-full-conventions-text), if relevant. - [ ] Tests if relevant. - [ ] All breaking changes documented. --------- Co-authored-by: Philipp Krüger <[email protected]>
Description
This adds a new API for creating blobs.
You create a batch. Within a batch you got all the usual operations to add stuff, like add_bytes, add_stream, add_from_path etc. Notable differences to the existing api:
The way to use the API is to just perform a complex operation within a batch, and then at the end assign a (non temporary) tag to the root(s) of the created data before dropping the batch.
It is possible to scan a directory and create a collection purely on the client side, so the code to traverse a directory can be removed from the node.
To allow the workflow described above, the tags client has been extended to allow manually setting a tag.
Ideally this API would entirely replace the current blobs API, so all read ops would always happen within the context of a batch.
Breaking Changes
At this point mostly adding stuff, but changing the RPC api for setting tags. I might decide to remove the entire non batch mutation API in this PR.
How it works is to leave a streaming RPC call open for each batch, then do operations in the context of an unique identifier for this RPC call.
Notes & open questions
Note: if things work out every add operation refers to a single blob, and the aggregation of many blobs can be driven from the client. That means that a lot of the complexity of the progress events like ids etc. can be removed from the rpc. This still needs to exist, but can be confined to the client.
Todo
Add back fine grained progressfine grained progress is not needed for all ops, but definitely for add_file and add_dir. Possibly for add_bytes and add_reader. Not sure if it is OK to have it in all cases despite it typically not being used.
Purge all tag setting stuff from the blobs API and the downloader.I would propose that we merge this initially as an addition, and do the stripdown of the other APIs in a subsequent PR. Also, if this is an addition we can do the fine grained progress for add_dir in a subsequent PR as well.
Change checklist