bug: in `poll_send`, return `Ok(n)` when we have no available addr, or else quinn will kill the endpoint #2226

ramfox · 2024-04-23T20:38:10Z

No description provided.

divagant-martian · 2024-04-23T21:54:01Z

Noticed this as well while checking quinn's code. Where exactly did you experience this thought? This is true as well for poll_recv which is the one that has been giving us more trouble

ramfox · 2024-04-24T13:56:06Z

This has been part of pasha's connection issues. We currently are in a strange state with discovery where (when joining a document) it is possible we attempt to dial before we have any addresses. That will be fixed in #2225 . But it illuminated this bug:

When we have no possible addresses, we return an error, and this shuts down the entire Endpoint

I'll take a look at the recv side as well.

…2225) ## Description A bug in the discovery flow assumed that if we had a mapped quic address for a `node_id`, then we had at least a relay URL or one direct address available for that node. This meant there were instances in which discovery should have been launched before attempting to dial the node, but was never launched, leading to `no UDP or relay address available for node` errors. Now we check if addresses are available for a `node_id` and launch discovery if we do not have any before we attempt to dial. We also now take into account the "alive" status of any relay URL information we have on a `node_id` when determining if we need to run discovery for that `node_id` ### tests This also refactors the test DNS server and the test Pkarr relay server to use `CleanupDropGuard` for resource cleanup. It also moves the functions that launch those servers into the `iroh-net::test_utils` crate. This ended up being an unnecessary refactor (I ended up writing the test in the `discovery` crate anyway), but I left it in case we need to do other tests that rely on discovery outside of the `discovery` crate. ## Notes & open questions The above issue uncovered a more serious bug: the endpoint currently dies when it attempts to dial a node without any available address information because we return the `no UDP or relay address available for node` error. We should not do this. In that situation, we should let that connection timeout or launch discovery services inside the magicsocket. That bug (#2226) is not fixed in this PR. ## Breaking changes - Created new public struct `RelayUrlInfo` that combines the relay_url and additional information about the state of our connection to the remote node at this relay URL: ``` struct RelayUrlInfo { relay_url: RelayUrl, /// How long since there has been activity on this relay url last_alive: Option<Duration>, /// Latest latency information for this relay url latency: Option<Duration>, } ``` - `NodeInfo.relay_url` (called `ConnectionInfo` outside of `iroh-net`) is now `Option<RelayUrlInfo>`, changed from `Option<RelayUrl>` ## Change checklist - [x] Self-review. - [x] Documentation updates if relevant. - [x] Tests if relevant. --------- Co-authored-by: Kasey <[email protected]>

…r direct addresses in `poll_send` (#2322) ## Description If we have no relay URL or addresses to send transmits for a NodeID in `poll_send`, what do we do? Returning `Polling::Ready(Err(e))` causes the endpoint to error, which causes all connections to fail. If we return `Polling::Pending` (in this case), we have no mechanism for waking the waker once the `poll_send` is returned. Also, even if we wake up and continue to poll, we will attempt to send the same transmits that we know we cannot send. If we return `Polling::Ready(Ok(0))`, we will get into a loop in Quinn that attempts to keep re-sending the same transmits. However, if we report back to Quinn that we *have* sent those transmits (by returning `Polling::Ready(Ok(n))`), then Quinn will move on and attempt to send new transmits. QUIC mechanisms might cause those transmits to be re-sent when we get no ACKs for them, but eventually, the connection will time out. closes #2226 ## Change checklist - [x] Self-review. - [x] Documentation updates if relevant. - [x] Tests if relevant. - [x] All breaking changes documented.

github-project-automation bot added this to iroh Apr 23, 2024

github-project-automation bot moved this to 📋 Backlog in iroh Apr 23, 2024

ramfox added this to the v0.15.0 milestone Apr 23, 2024

ramfox added bug Something isn't working c-iroh labels Apr 23, 2024

ramfox changed the title ~~bug: in poll_send, returning anything except Ok(n) results in quinn killing the endpoint~~ bug: in poll_send, returning an Ready(Err(e)) & not Ok(n) results in quinn killing the endpoint Apr 24, 2024

ramfox changed the title ~~bug: in poll_send, returning an Ready(Err(e)) & not Ok(n) results in quinn killing the endpoint~~ bug: in poll_send, return Ok(n) when we have no available addr, or else quinn will kill the endpoint Apr 24, 2024

ramfox mentioned this issue Apr 26, 2024

fix(iroh-net)!: only call quinn_connect if a send addr is available #2225

Merged

3 tasks

ramfox modified the milestones: v0.15.0, v0.16.0 Apr 26, 2024

ramfox self-assigned this May 1, 2024

ramfox modified the milestones: v0.16.0, v0.17.0 May 13, 2024

dignifiedquire removed this from the v0.17.0 milestone May 22, 2024

ramfox mentioned this issue May 23, 2024

fix(iroh-net): return Poll::Read(Ok(n)) when we have no relay URL or direct addresses in poll_send #2322

Merged

4 tasks

ramfox closed this as completed in #2322 Jun 6, 2024

github-project-automation bot moved this from 📋 Backlog to ✅ Done in iroh Jun 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: in `poll_send`, return `Ok(n)` when we have no available addr, or else quinn will kill the endpoint #2226

bug: in `poll_send`, return `Ok(n)` when we have no available addr, or else quinn will kill the endpoint #2226

ramfox commented Apr 23, 2024

divagant-martian commented Apr 23, 2024

ramfox commented Apr 24, 2024 •

edited

Loading

bug: in poll_send, return Ok(n) when we have no available addr, or else quinn will kill the endpoint #2226

bug: in poll_send, return Ok(n) when we have no available addr, or else quinn will kill the endpoint #2226

Comments

ramfox commented Apr 23, 2024

divagant-martian commented Apr 23, 2024

ramfox commented Apr 24, 2024 • edited Loading

bug: in `poll_send`, return `Ok(n)` when we have no available addr, or else quinn will kill the endpoint #2226

bug: in `poll_send`, return `Ok(n)` when we have no available addr, or else quinn will kill the endpoint #2226

ramfox commented Apr 24, 2024 •

edited

Loading