-
Notifications
You must be signed in to change notification settings - Fork 172
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug: in poll_send
, return Ok(n)
when we have no available addr, or else quinn will kill the endpoint
#2226
Comments
Noticed this as well while checking quinn's code. Where exactly did you experience this thought? This is true as well for |
poll_send
, returning anything except Ok(n)
results in quinn killing the endpointpoll_send
, returning an Ready(Err(e))
& not Ok(n)
results in quinn killing the endpoint
This has been part of pasha's connection issues. We currently are in a strange state with discovery where (when joining a document) it is possible we attempt to dial before we have any addresses. That will be fixed in #2225 . But it illuminated this bug: When we have no possible addresses, we return an error, and this shuts down the entire I'll take a look at the recv side as well. |
poll_send
, returning an Ready(Err(e))
& not Ok(n)
results in quinn killing the endpointpoll_send
, return Ok(n)
when we have no available addr, or else quinn will kill the endpoint
…2225) ## Description A bug in the discovery flow assumed that if we had a mapped quic address for a `node_id`, then we had at least a relay URL or one direct address available for that node. This meant there were instances in which discovery should have been launched before attempting to dial the node, but was never launched, leading to `no UDP or relay address available for node` errors. Now we check if addresses are available for a `node_id` and launch discovery if we do not have any before we attempt to dial. We also now take into account the "alive" status of any relay URL information we have on a `node_id` when determining if we need to run discovery for that `node_id` ### tests This also refactors the test DNS server and the test Pkarr relay server to use `CleanupDropGuard` for resource cleanup. It also moves the functions that launch those servers into the `iroh-net::test_utils` crate. This ended up being an unnecessary refactor (I ended up writing the test in the `discovery` crate anyway), but I left it in case we need to do other tests that rely on discovery outside of the `discovery` crate. ## Notes & open questions The above issue uncovered a more serious bug: the endpoint currently dies when it attempts to dial a node without any available address information because we return the `no UDP or relay address available for node` error. We should not do this. In that situation, we should let that connection timeout or launch discovery services inside the magicsocket. That bug (#2226) is not fixed in this PR. ## Breaking changes - Created new public struct `RelayUrlInfo` that combines the relay_url and additional information about the state of our connection to the remote node at this relay URL: ``` struct RelayUrlInfo { relay_url: RelayUrl, /// How long since there has been activity on this relay url last_alive: Option<Duration>, /// Latest latency information for this relay url latency: Option<Duration>, } ``` - `NodeInfo.relay_url` (called `ConnectionInfo` outside of `iroh-net`) is now `Option<RelayUrlInfo>`, changed from `Option<RelayUrl>` ## Change checklist - [x] Self-review. - [x] Documentation updates if relevant. - [x] Tests if relevant. --------- Co-authored-by: Kasey <[email protected]>
…r direct addresses in `poll_send` (#2322) ## Description If we have no relay URL or addresses to send transmits for a NodeID in `poll_send`, what do we do? Returning `Polling::Ready(Err(e))` causes the endpoint to error, which causes all connections to fail. If we return `Polling::Pending` (in this case), we have no mechanism for waking the waker once the `poll_send` is returned. Also, even if we wake up and continue to poll, we will attempt to send the same transmits that we know we cannot send. If we return `Polling::Ready(Ok(0))`, we will get into a loop in Quinn that attempts to keep re-sending the same transmits. However, if we report back to Quinn that we *have* sent those transmits (by returning `Polling::Ready(Ok(n))`), then Quinn will move on and attempt to send new transmits. QUIC mechanisms might cause those transmits to be re-sent when we get no ACKs for them, but eventually, the connection will time out. closes #2226 ## Change checklist - [x] Self-review. - [x] Documentation updates if relevant. - [x] Tests if relevant. - [x] All breaking changes documented.
No description provided.
The text was updated successfully, but these errors were encountered: