Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(iroh-net)!: only call quinn_connect if a send addr is available #2225

Merged
merged 6 commits into from
Apr 26, 2024

Conversation

Frando
Copy link
Member

@Frando Frando commented Apr 23, 2024

Description

A bug in the discovery flow assumed that if we had a mapped quic address for a node_id, then we had at least a relay URL or one direct address available for that node.

This meant there were instances in which discovery should have been launched before attempting to dial the node, but was never launched, leading to no UDP or relay address available for node errors.

Now we check if addresses are available for a node_id and launch discovery if we do not have any before we attempt to dial.

We also now take into account the "alive" status of any relay URL information we have on a node_id when determining if we need to run discovery for that node_id

tests

This also refactors the test DNS server and the test Pkarr relay server to use CleanupDropGuard for resource cleanup. It also moves the functions that launch those servers into the iroh-net::test_utils crate. This ended up being an unnecessary refactor (I ended up writing the test in the discovery crate anyway), but I left it in case we need to do other tests that rely on discovery outside of the discovery crate.

Notes & open questions

The above issue uncovered a more serious bug: the endpoint currently dies when it attempts to dial a node without any available address information because we return the no UDP or relay address available for node error. We should not do this. In that situation, we should let that connection timeout or launch discovery services inside the magicsocket. That bug (#2226) is not fixed in this PR.

Breaking changes

  • Created new public struct RelayUrlInfo that combines the relay_url and additional information about the state of our connection to the remote node at this relay URL:
struct RelayUrlInfo {
    relay_url: RelayUrl,
    /// How long since there has been activity on this relay url
    last_alive: Option<Duration>,
    /// Latest latency information for this relay url
    latency: Option<Duration>,
}
  • NodeInfo.relay_url (called ConnectionInfo outside of iroh-net) is now Option<RelayUrlInfo>, changed from Option<RelayUrl>

Change checklist

  • Self-review.
  • Documentation updates if relevant.
  • Tests if relevant.

@Frando Frando marked this pull request as draft April 23, 2024 20:25
@ramfox ramfox self-assigned this Apr 23, 2024
@ramfox
Copy link
Contributor

ramfox commented Apr 23, 2024

We also need to ensure that when we get messages from the discovery service, that the messages contain non-empty address fields before returning: discovery.rs line 247. Check that the message contains addrs before adding it to the endpoint & notifying that we have had at least one message response.

@ramfox ramfox added the fix Fixes a bug label Apr 23, 2024
@ramfox ramfox added this to the v0.15.0 milestone Apr 23, 2024
@flub
Copy link
Contributor

flub commented Apr 24, 2024

Maybe on some level the problem is that NodeAddr is allowed to be completely empty. We could try and change the type so that this is not possible. Of course this also means discovery is a problem, and maybe having discovery explicitly outside of the MagicEndpoint/MagicSock is not a bad option.

OTOH if NodeAddr has a single direct socket address in it that does not work, it would be accepted and the MagicSock would still simply time out. This case could be considered indistinguishable from an empty NodeAddr. This last criticism also applies to what is being tried here: if things align you get an early error, but it doesn't help the general case of the node simply not being reachable when tried.

So I kind of think that instead of adding the APIs here to attempt to figure out if a node is reachable that we should make MagicSock handle it more gracefully. Since however complex we make these APIs they'll never work (currently at least I think).

Frando and others added 4 commits April 26, 2024 12:59
to properly test that we have fixed the discovery bug, we need to run the test dns and pkarr relay servers, so they should be moved to `test_utils` so the code can be shared
this method handles the logic of getting a mapped quic addr and launching discovery if it is needed
@ramfox
Copy link
Contributor

ramfox commented Apr 26, 2024

Maybe on some level the problem is that NodeAddr is allowed to be completely empty.

But we now allow for users to dial with just the node_id, so we need to be able to allow for them to add an "empty" node addr.

I agree that this doesn't cover the entire solution: we should not fail when there are no available addrs inside the magicsock. At worst, the connection should timeout as you mentioned.

But that was a more complicated issue, the solution of which might mean moving Discovery management down into the magicsock & I wanted to make sure we had a fix for this particular logic bug (which only showed up when trying to join a document ticket) sooner.

@ramfox ramfox force-pushed the fix/connect-send-addr branch from a674622 to d034423 Compare April 26, 2024 17:05
@ramfox ramfox marked this pull request as ready for review April 26, 2024 17:05
@ramfox ramfox changed the title fix(iroh-net): only call quinn_connect if a send addr is available fix!(iroh-net): only call quinn_connect if a send addr is available Apr 26, 2024
@dignifiedquire dignifiedquire changed the title fix!(iroh-net): only call quinn_connect if a send addr is available fix(iroh-net)!: only call quinn_connect if a send addr is available Apr 26, 2024
ramfox added 2 commits April 26, 2024 16:37
`RelayUrlInfo` combines the `RelayUrl` with any "aliveness" and
latency information we have. This is propgated up into the
`ConnectionInfo`.
@ramfox ramfox enabled auto-merge April 26, 2024 20:53
@ramfox ramfox added this pull request to the merge queue Apr 26, 2024
Merged via the queue into main with commit e913051 Apr 26, 2024
21 checks passed
@dignifiedquire dignifiedquire deleted the fix/connect-send-addr branch April 26, 2024 21:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fix Fixes a bug
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

4 participants