-
Notifications
You must be signed in to change notification settings - Fork 171
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Netsim error: Expected header frame #2399
Comments
First commit we've seen this happen on in CI: f73c506 |
Here's the link to one of the netsim jobs failing: https://github.com/n0-computer/iroh/actions/runs/9585622113/job/26431903382 |
So going through the flow once end-to-end:
let mut blob_read = iroh.blobs().read(hash).await?;
tokio::io::copy(&mut blob_read, &mut tokio::io::stdout()).
// iroh-cli/src/commands/start.rs:100
tokio::select! {
biased;
// always abort on signal-c
_ = tokio::signal::ctrl_c(), if run_type != RunType::SingleCommandNoAbort => {
command_task.abort();
node.shutdown().await?;
}
// abort if the command task finishes (will run forever if not in single-command mode)
res = &mut command_task => {
let _ = node.shutdown().await;
res??;
}
} see how the shutdown is clearly happening after the command task completed. the read RPC call is read, a new task in the self.inner.rt.spawn_pinned(move || async move {
if let Err(err) = read_loop(req, db, tx.clone(), RPC_BLOB_GET_CHUNK_SIZE).await {
tx.send_async(RpcResult::Err(err.into())).await.ok();
}
}); The Now I'm reading the source code of However! This is an unbounded channel, so dropping the But still I can't see why we would start shutdown before the command finishes. So this is maybe something to do better, but likely not the cause of the bug at hand. |
From @rklaehn, two log cases, one with the bug occuring, one without the bug occuring:
Without Bug:
It seems like |
When removing the polling of the JoinSet, the issue seems to go away. We probably don't want this, but it is an interesting data point. |
I think I found it. The issue is because fn accept() in quic-rpc is not cancel-safe. It does two things, first accepting a stream pair, then awaiting the first message of the stream pair. Now if this future is dropped after it has accepted a stream pair but before it has received the first message, the stream pair is lost. This will look to the caller like the stream not being answered at all. Here is a log of a failure with extended custom logging in quic-rpc:
Here is how it fails: we are awaiting a new stream pair
then awaiting the first message of the stream pair
then the future gets dropped and recreated
and the stream pair is gone. The client side will see the stream pair dropped, and will produce the error
|
fixed in #2416 |
Error source: https://github.com/n0-computer/iroh/blob/main/iroh/src/client/blobs.rs#L801
The text was updated successfully, but these errors were encountered: