Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(iroh)!: Blob batch PR, attempt 3 #2545

Merged
merged 20 commits into from
Aug 15, 2024
Merged

feat(iroh)!: Blob batch PR, attempt 3 #2545

merged 20 commits into from
Aug 15, 2024

Conversation

rklaehn
Copy link
Contributor

@rklaehn rklaehn commented Jul 25, 2024

Description

This is the third attempt to add a batch API for adding blobs. Previous one was #2339

The basic idea is the following: all changes to the store happen in the context of a batch. All write operations within a batch produce temp tags. These temp tags are scoped to the batch and keep the data alive as long as the batch exists.

At some point, the API user has to upgrade one or more temp tags to permanent tags.

All non-batch operations would long term be implemented in terms of batch operations.

In a second step, the following rpc calls would be replaced by their batch equivalent.

  • AddStream
  • AddPath
  • CreateCollection

The third one is very nice, since it means that the notion of a collection (as in a special kind of hashseq) no longer has to even exist in the node code.

Breaking Changes

  • iroh::client::blobs::BlobStatus has a new case NotFound
  • iroh::client::blobs::BlobStatus::Partial: size is now a BaoBlobSize instead of a u64

All other public changes are adding of new APIs.

Notes & open questions

Note: in the previous version I had an optimisation to avoid storing TempTags in the case where there are multiple TempTags with the same hash. I removed this to keep things simple. We can add it back later.

Change checklist

  • Self-review.
  • Documentation updates following the style guide, if relevant.
  • Tests if relevant.
  • All breaking changes documented.

@rklaehn rklaehn force-pushed the batch-blob-api-3 branch from b9e6766 to a372bd6 Compare July 25, 2024 09:09
@rklaehn rklaehn changed the title Blob batch PR, attempt 3 feat(iroh)!: Blob batch PR, attempt 3 Jul 25, 2024
@rklaehn rklaehn force-pushed the batch-blob-api-3 branch from a372bd6 to ede3c55 Compare July 25, 2024 09:14
@rklaehn rklaehn requested a review from matheus23 July 25, 2024 09:25
@matheus23
Copy link
Contributor

matheus23 commented Jul 25, 2024

I would be curious what you think of the API.

I admit, after looking at it a little more, I'm somewhat surprised by the TempTag API combined with the Batch. I guess I had a more limited API that's more tailored to creating Collections or HashSeqes in mind, something like this:

impl iroh::client::blobs::Client {
  pub async fn batch() -> Result<Batch>;
}

impl iroh::client::blobs::Batch {
  pub async fn push_bytes(&self, name: String, bytes: impl Into<Bytes>) -> Result<()>;
  pub async fn push_file(&self, name: String, path: PathBuf) -> Result<u64>;
  pub async fn push_dir(&self, name: String, root: PathBuf) -> Result<()>;
  // [...]
  pub async fn push_dir_with_opts(
    &self,
    root: PathBuf,
    opts: AddDirOpts,
  ) -> Result<()>;
  // [...]
  pub async fn commit(&self) -> Result<Tag>; // maybe even taking an owned `self`?
  pub async fn commit_to(&self, tag: &Tag) -> Result<()>;
}

So this would commit the batch to be a Collection, and you wouldn't surface the concept of TempTags to the user at all.

But it's also more limited, so a classic complexity vs. expressiveness issue.
Thoughts?

Perhaps it makes sense to investigate some use-cases to see if the added expressiveness is worth it. "Importing the linux kernel" is definitely one.

@rklaehn
Copy link
Contributor Author

rklaehn commented Jul 25, 2024

I would be curious what you think of the API.

I admit, after looking at it a little more, I'm somewhat surprised by the TempTag API combined with the Batch. I guess I had a more limited API that's more tailored to creating Collections or HashSeqes in mind, something like this:

impl iroh::client::blobs::Client {
  pub async fn batch() -> Result<Batch>;
}

impl iroh::client::blobs::Batch {
  pub async fn push_bytes(&self, name: String, bytes: impl Into<Bytes>) -> Result<()>;
  pub async fn push_file(&self, name: String, path: PathBuf) -> Result<u64>;
  pub async fn push_dir(&self, name: String, root: PathBuf) -> Result<()>;
  // [...]
  pub async fn push_dir_with_opts(
    &self,
    root: PathBuf,
    opts: AddDirOpts,
  ) -> Result<()>;
  // [...]
  pub async fn commit(&self) -> Result<Tag>; // maybe even taking an owned `self`?
  pub async fn commit_to(&self, tag: &Tag) -> Result<()>;
}

So this would commit the batch to be a Collection, and you wouldn't surface the concept of TempTags to the user at all.

But it's also more limited, so a classic complexity vs. expressiveness issue. Thoughts?

Perhaps it makes sense to investigate some use-cases to see if the added expressiveness is worth it. "Importing the linux kernel" is definitely one.

(I would prefer the concept of a collection to be just a tiny sliver in the client)

The above proposal assumes that you always want to create exactly 1 collection or HashSeq, but e.g. it is possible that you want to create multiple individually tagged blobs or hashseqs in a batch.

The long term idea is that even interactions with the network that modify the local store will happen in the context of a batch, so you will be able to use temp tags.

Imagine traversing a dag. You get a blob, parse it for links, get another blob, parse it for links, finally find what you have been looking for, then assign a permanent tag. This would be easily expressible using the API using temp tags.

Admittedly the concept of temp tags is even more useful in ipfs where you can build more complex dags from the leafs and then assign a final tag once you are at the root. But I think it is sufficiently useful even in iroh with its 1 level hierarchies.

Ah, another problem with this API is that the creation of the HashSeq is more or less sequential, whereas with TempTags you can just spawn a bunch of tasks that download stuff, then collect into a sequence of TempTags, then create a HashSeq and make it permanent.

@matheus23
Copy link
Contributor

Okay. I'm convinced the additional expressiveness is worth it.
One thing I'd really like to change is the naming of Batch::upgrade, though. Can we switch it to something like make_permanent or commit? Or something else if there are other suggestions.
Upgrade only makes sense once I've read the docs for Tag and TempTag, looked at the Batch::upgrade signature and compared the two items in my head.
And if I'm looking for ways to make my temporary changes permanent, I end up skipping a function name like upgrade in my mind.

@rklaehn
Copy link
Contributor Author

rklaehn commented Aug 12, 2024

Okay. I'm convinced the additional expressiveness is worth it.

Glad I convinced you.

One thing I'd really like to change is the naming of Batch::upgrade, though. Can we switch it to something like make_permanent or commit? Or something else if there are other suggestions.

Agree that upgrade is not great. Maybe persist? I mean, a temp tag is purely in memory while a named tag is persistent in the database.

@matheus23
Copy link
Contributor

Yes to persist in favor of upgrade

# Conflicts:
#	iroh/src/client.rs
#	iroh/src/client/blobs.rs
#	iroh/src/lib.rs
#	iroh/src/node.rs
Copy link

github-actions bot commented Aug 14, 2024

Documentation for this PR has been generated and is available at: https://n0-computer.github.io/iroh/pr/2545/docs/iroh/

Last updated: 2024-08-15T09:03:08Z

@rklaehn
Copy link
Contributor Author

rklaehn commented Aug 14, 2024

@rklaehn rklaehn marked this pull request as ready for review August 14, 2024 15:29
Copy link
Contributor

@matheus23 matheus23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All in all I'm pretty happy with this. I'd love to now be able to turn all other non-batch blob APIs into ones that always call sync(), but I guess before we can really do that, we'd have needed to make sure that the docs API also has a batch API without immediate sync()s after every operation.
Anyhow, that's for another PR, but just repeating that that's why I'm excited for this.

As for the review:

  • Lots of small documentation nits/suggestions
  • I'm not really happy with the wait_for_gc function simply sleeping for 50s. We really shouldn't have tests like that. EDIT whoops that's not what it's doing. It's 50 milliseconds

iroh/src/client/blobs/batch.rs Outdated Show resolved Hide resolved
iroh/src/client/blobs/batch.rs Outdated Show resolved Hide resolved
iroh/src/client/blobs/batch.rs Outdated Show resolved Hide resolved
iroh/src/client/blobs/batch.rs Outdated Show resolved Hide resolved
iroh/src/client/blobs/batch.rs Outdated Show resolved Hide resolved
iroh/src/client/blobs/batch.rs Outdated Show resolved Hide resolved
iroh/src/client/blobs/batch.rs Outdated Show resolved Hide resolved
iroh/src/client/blobs/batch.rs Outdated Show resolved Hide resolved
iroh/src/node/rpc.rs Outdated Show resolved Hide resolved
iroh/tests/batch.rs Outdated Show resolved Hide resolved
@rklaehn rklaehn enabled auto-merge August 15, 2024 09:11
@rklaehn rklaehn added this pull request to the merge queue Aug 15, 2024
Merged via the queue into main with commit 9a55122 Aug 15, 2024
27 of 28 checks passed
@dignifiedquire dignifiedquire deleted the batch-blob-api-3 branch August 15, 2024 09:34
@matheus23 matheus23 mentioned this pull request Aug 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

2 participants