fix(iroh-net)!: only call quinn_connect if a send addr is available #2225

Frando · 2024-04-23T20:24:48Z

Description

A bug in the discovery flow assumed that if we had a mapped quic address for a node_id, then we had at least a relay URL or one direct address available for that node.

This meant there were instances in which discovery should have been launched before attempting to dial the node, but was never launched, leading to no UDP or relay address available for node errors.

Now we check if addresses are available for a node_id and launch discovery if we do not have any before we attempt to dial.

We also now take into account the "alive" status of any relay URL information we have on a node_id when determining if we need to run discovery for that node_id

tests

This also refactors the test DNS server and the test Pkarr relay server to use CleanupDropGuard for resource cleanup. It also moves the functions that launch those servers into the iroh-net::test_utils crate. This ended up being an unnecessary refactor (I ended up writing the test in the discovery crate anyway), but I left it in case we need to do other tests that rely on discovery outside of the discovery crate.

Notes & open questions

The above issue uncovered a more serious bug: the endpoint currently dies when it attempts to dial a node without any available address information because we return the no UDP or relay address available for node error. We should not do this. In that situation, we should let that connection timeout or launch discovery services inside the magicsocket. That bug (#2226) is not fixed in this PR.

Breaking changes

Created new public struct RelayUrlInfo that combines the relay_url and additional information about the state of our connection to the remote node at this relay URL:

struct RelayUrlInfo {
    relay_url: RelayUrl,
    /// How long since there has been activity on this relay url
    last_alive: Option<Duration>,
    /// Latest latency information for this relay url
    latency: Option<Duration>,
}

NodeInfo.relay_url (called ConnectionInfo outside of iroh-net) is now Option<RelayUrlInfo>, changed from Option<RelayUrl>

Change checklist

Self-review.
Documentation updates if relevant.
Tests if relevant.

ramfox · 2024-04-23T20:40:03Z

We also need to ensure that when we get messages from the discovery service, that the messages contain non-empty address fields before returning: discovery.rs line 247. Check that the message contains addrs before adding it to the endpoint & notifying that we have had at least one message response.

flub · 2024-04-24T08:35:36Z

Maybe on some level the problem is that NodeAddr is allowed to be completely empty. We could try and change the type so that this is not possible. Of course this also means discovery is a problem, and maybe having discovery explicitly outside of the MagicEndpoint/MagicSock is not a bad option.

OTOH if NodeAddr has a single direct socket address in it that does not work, it would be accepted and the MagicSock would still simply time out. This case could be considered indistinguishable from an empty NodeAddr. This last criticism also applies to what is being tried here: if things align you get an early error, but it doesn't help the general case of the node simply not being reachable when tried.

So I kind of think that instead of adding the APIs here to attempt to figure out if a node is reachable that we should make MagicSock handle it more gracefully. Since however complex we make these APIs they'll never work (currently at least I think).

to properly test that we have fixed the discovery bug, we need to run the test dns and pkarr relay servers, so they should be moved to `test_utils` so the code can be shared

… info for a given node_id

this method handles the logic of getting a mapped quic addr and launching discovery if it is needed

ramfox · 2024-04-26T17:05:30Z

Maybe on some level the problem is that NodeAddr is allowed to be completely empty.

But we now allow for users to dial with just the node_id, so we need to be able to allow for them to add an "empty" node addr.

I agree that this doesn't cover the entire solution: we should not fail when there are no available addrs inside the magicsock. At worst, the connection should timeout as you mentioned.

But that was a more complicated issue, the solution of which might mean moving Discovery management down into the magicsock & I wanted to make sure we had a fix for this particular logic bug (which only showed up when trying to join a document ticket) sooner.

`RelayUrlInfo` combines the `RelayUrl` with any "aliveness" and latency information we have. This is propgated up into the `ConnectionInfo`.

Frando marked this pull request as draft April 23, 2024 20:25

ramfox self-assigned this Apr 23, 2024

ramfox added the fix Fixes a bug label Apr 23, 2024

ramfox added this to the v0.15.0 milestone Apr 23, 2024

ramfox mentioned this pull request Apr 24, 2024

bug: in poll_send, return Ok(n) when we have no available addr, or else quinn will kill the endpoint #2226

Closed

ramfox force-pushed the fix/connect-send-addr branch from 81fac77 to 8b483b8 Compare April 25, 2024 04:31

Frando and others added 4 commits April 26, 2024 12:59

fix(iroh-net): only call quinn_connect if a send addr is available

1b3fe7c

move run_dns_and_pkarr_servers to iroh::net::test_utils

397c985

to properly test that we have fixed the discovery bug, we need to run the test dns and pkarr relay servers, so they should be moved to `test_utils` so the code can be shared

write test that ensures we launch discovery if we do not have address…

71f6cca

… info for a given node_id

refactor, remove some magicsock apis and add

d034423

this method handles the logic of getting a mapped quic addr and launching discovery if it is needed

ramfox force-pushed the fix/connect-send-addr branch from a674622 to d034423 Compare April 26, 2024 17:05

ramfox marked this pull request as ready for review April 26, 2024 17:05

ramfox changed the title ~~fix(iroh-net): only call quinn_connect if a send addr is available~~ fix!(iroh-net): only call quinn_connect if a send addr is available Apr 26, 2024

ramfox requested review from flub, divagant-martian and dignifiedquire April 26, 2024 18:45

dignifiedquire changed the title ~~fix!(iroh-net): only call quinn_connect if a send addr is available~~ fix(iroh-net)!: only call quinn_connect if a send addr is available Apr 26, 2024

dignifiedquire approved these changes Apr 26, 2024

View reviewed changes

ramfox added 2 commits April 26, 2024 16:37

feat!(iroh-net): NodeInfo.relay_url is now a RelayUrlInfo

55cfc91

`RelayUrlInfo` combines the `RelayUrl` with any "aliveness" and latency information we have. This is propgated up into the `ConnectionInfo`.

zero out non-deterministic fields in test

758d0ad

ramfox enabled auto-merge April 26, 2024 20:53

ramfox added this pull request to the merge queue Apr 26, 2024

Merged via the queue into main with commit e913051 Apr 26, 2024
21 checks passed

dignifiedquire deleted the fix/connect-send-addr branch April 26, 2024 21:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(iroh-net)!: only call quinn_connect if a send addr is available #2225

fix(iroh-net)!: only call quinn_connect if a send addr is available #2225

Frando commented Apr 23, 2024 •

edited by ramfox

Loading

ramfox commented Apr 23, 2024

flub commented Apr 24, 2024

ramfox commented Apr 26, 2024

fix(iroh-net)!: only call quinn_connect if a send addr is available #2225

fix(iroh-net)!: only call quinn_connect if a send addr is available #2225

Conversation

Frando commented Apr 23, 2024 • edited by ramfox Loading

Description

tests

Notes & open questions

Breaking changes

Change checklist

ramfox commented Apr 23, 2024

flub commented Apr 24, 2024

ramfox commented Apr 26, 2024

Frando commented Apr 23, 2024 •

edited by ramfox

Loading