wip: attempt to call discovery inside the magicsock #2256

ramfox · 2024-05-01T01:57:40Z

Description

The discovery api requires passing in a MagicEndpoint right now, which limits how we can use it in the magicsocket.

We have two options as I can see it:

The discovery mechanism needs to be refactored to house all of it's own buisness.
For example, right now DNS discovery expect access to the endpoint's DNS resolver. If it housed its own dns resolver (or a copy of the endpoint's resolver), than we wouldn't need to worry where we call this from.
Then, the magicsock can be in charge of starting, cancelling, and managing discovery tasks, if we have a discovery service enabled.
MagicEndpoint has an mechanism to drive discovery requests from the magicsock.
This requires fewer changes, and makes it so that the magicsock does not need to care about what it calls for discovery, since the endpoint takes care of it.
I'm going to try start with this, since if we go for option 1 we can just move the same mechanism into the magicsock.

Breaking Changes

Notes & open questions

Closing this PR without merging. During this process I've realized that Discovery should be limited to finding your neighbor's addresses BEFORE we connect, and therefore should NOT be moved down into the magicsock.

Discovery is for finding our neighbor's addresses BEFORE we connect

Discovery should be limited to connection initialization. We should not run discovery once a connection has already been established, and we should rely on our other systems to heal connections when a node has migrated to a new network or has a new network configuration.

This means we should ensure that we cannot / do not call quinn_connect for any NodeAddr that consists only of a NodeID.

If this situation occurs, calling quinn_connect on a NodeAddr with no relay URL or direct addrs, we have very limited ways of communicating this issue back up to the user, unless it is caught before we dial. Once we dial, the Endpoint will log an error, for each packet we attempt to send to a node with no relay URL or direct addrs, and eventually timeout.

How does the decision “Discovery should be limited to connection initialization” affect our current code? First, don't merge this PR.

Since we made magicsock private, we are the only ones who can get ourselves into this state. We need to make sure we never call quinn_connect with an empty node address.

However, in the future, if we ever get more aggressive in pruning bad addresses, we may run into this dilemma again. I suggest that we never prune addresses WHILE a connection is happening and only before we attempt to connect or on shutdown/cleanup. That way, we can launch discovery, if needed, before attempting to make the connection.

Outstanding questions for discovery:

Discovery currently works quickly because our current DNS setup works well. But in theory, especially if someone writes a distributed version of discovery, this can be very slow. In the IPFS paradigm, this also often leads to confusion—am I not connecting because the address I received from Discovery did not work, OR did Discovery still not respond?
Should we be adding timeouts to discovery? I think so.

Change checklist

Self-review.
Documentation updates if relevant.
Tests if relevant.
All breaking changes documented.

… start discovery.

this is not a clean commit

ramfox · 2024-05-23T18:37:40Z

Closing this PR without merging. During this process I've realized that Discovery should be limited to finding your neighbor's addresses BEFORE we connect, and therefore should NOT be moved down into the magicsock.

Discovery is for finding our neighbor's addresses BEFORE we connect
Discovery should be limited to connection initialization. We should not run discovery once a connection has already been established, and we should rely on our other systems to heal connections when a node has migrated to a new network or has a new network configuration.

This means we should ensure that we cannot / do not call quinn_connect for any NodeAddr that consists only of a NodeID.

If this situation occurs, calling quinn_connect on a NodeAddr with no relay URL or direct addrs, we have very limited ways of communicating this issue back up to the user, unless it is caught before we dial. Once we dial, the Endpoint will log an error, for each packet we attempt to send to a node with no relay URL or direct addrs, and eventually timeout.

How does the decision “Discovery should be limited to connection initialization” affect our current code? First, don't merge this PR.

Since we made magicsock private, we are the only ones who can get ourselves into this state. We need to make sure we never call quinn_connect with an empty node address.

However, in the future, if we ever get more aggressive in pruning bad addresses, we may run into this dilemma again. I suggest that we never prune addresses WHILE a connection is happening and only before we attempt to connect or on shutdown/cleanup. That way, we can launch discovery, if needed, before attempting to make the connection.

Outstanding questions for discovery:

Discovery currently works quickly because our current DNS setup works well. But in theory, especially if someone writes a distributed version of discovery, this can be very slow. In the IPFS paradigm, this also often leads to confusion—am I not connecting because the address I received from Discovery did not work, OR did Discovery still not respond?
Should we be adding timeouts to discovery? I think so.

ramfox added the c-iroh label May 1, 2024

ramfox self-assigned this May 1, 2024

ramfox added this to the v0.16.0 milestone May 1, 2024

ramfox added 4 commits May 6, 2024 19:02

wip: attempt to call discovery inside the magicsock

69c5dd5

refactor(iroh-net): use DiscoveryTasks to manage discovery tasks

368e3fa

feat(iroh-net): if no addresses exist for a node when we try to send,…

c7b0e71

… start discovery.

wip: moving discovery down into magicsock

f06dd38

this is not a clean commit

ramfox force-pushed the discovery_experiment branch from d61ed6b to f06dd38 Compare May 6, 2024 23:02

ramfox modified the milestones: v0.16.0, v0.17.0 May 13, 2024

dignifiedquire removed this from the v0.17.0 milestone May 21, 2024

ramfox closed this May 23, 2024

ramfox deleted the discovery_experiment branch May 23, 2024 18:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wip: attempt to call discovery inside the magicsock #2256

wip: attempt to call discovery inside the magicsock #2256

ramfox commented May 1, 2024 •

edited

Loading

ramfox commented May 23, 2024

wip: attempt to call discovery inside the magicsock #2256

wip: attempt to call discovery inside the magicsock #2256

Conversation

ramfox commented May 1, 2024 • edited Loading

Description

Breaking Changes

Notes & open questions

Discovery is for finding our neighbor's addresses BEFORE we connect

Change checklist

ramfox commented May 23, 2024

ramfox commented May 1, 2024 •

edited

Loading