Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wip: attempt to call discovery inside the magicsock #2256

Closed
wants to merge 4 commits into from

Conversation

ramfox
Copy link
Contributor

@ramfox ramfox commented May 1, 2024

Description

The discovery api requires passing in a MagicEndpoint right now, which limits how we can use it in the magicsocket.

We have two options as I can see it:

  1. The discovery mechanism needs to be refactored to house all of it's own buisness.
    For example, right now DNS discovery expect access to the endpoint's DNS resolver. If it housed its own dns resolver (or a copy of the endpoint's resolver), than we wouldn't need to worry where we call this from.
    Then, the magicsock can be in charge of starting, cancelling, and managing discovery tasks, if we have a discovery service enabled.

  2. MagicEndpoint has an mechanism to drive discovery requests from the magicsock.
    This requires fewer changes, and makes it so that the magicsock does not need to care about what it calls for discovery, since the endpoint takes care of it.
    I'm going to try start with this, since if we go for option 1 we can just move the same mechanism into the magicsock.

Breaking Changes

Notes & open questions

Closing this PR without merging. During this process I've realized that Discovery should be limited to finding your neighbor's addresses BEFORE we connect, and therefore should NOT be moved down into the magicsock.

Discovery is for finding our neighbor's addresses BEFORE we connect

Discovery should be limited to connection initialization. We should not run discovery once a connection has already been established, and we should rely on our other systems to heal connections when a node has migrated to a new network or has a new network configuration.

This means we should ensure that we cannot / do not call quinn_connect for any NodeAddr that consists only of a NodeID.

If this situation occurs, calling quinn_connect on a NodeAddr with no relay URL or direct addrs, we have very limited ways of communicating this issue back up to the user, unless it is caught before we dial. Once we dial, the Endpoint will log an error, for each packet we attempt to send to a node with no relay URL or direct addrs, and eventually timeout.

How does the decision “Discovery should be limited to connection initialization” affect our current code? First, don't merge this PR.

Since we made magicsock private, we are the only ones who can get ourselves into this state. We need to make sure we never call quinn_connect with an empty node address.

However, in the future, if we ever get more aggressive in pruning bad addresses, we may run into this dilemma again. I suggest that we never prune addresses WHILE a connection is happening and only before we attempt to connect or on shutdown/cleanup. That way, we can launch discovery, if needed, before attempting to make the connection.

Outstanding questions for discovery:

  • Discovery currently works quickly because our current DNS setup works well. But in theory, especially if someone writes a distributed version of discovery, this can be very slow. In the IPFS paradigm, this also often leads to confusion—am I not connecting because the address I received from Discovery did not work, OR did Discovery still not respond?
  • Should we be adding timeouts to discovery? I think so.

Change checklist

  • Self-review.
  • Documentation updates if relevant.
  • Tests if relevant.
  • All breaking changes documented.

@ramfox ramfox added the c-iroh label May 1, 2024
@ramfox ramfox self-assigned this May 1, 2024
@ramfox ramfox added this to the v0.16.0 milestone May 1, 2024
@ramfox ramfox force-pushed the discovery_experiment branch from d61ed6b to f06dd38 Compare May 6, 2024 23:02
@ramfox ramfox modified the milestones: v0.16.0, v0.17.0 May 13, 2024
@dignifiedquire dignifiedquire removed this from the v0.17.0 milestone May 21, 2024
@ramfox
Copy link
Contributor Author

ramfox commented May 23, 2024

Closing this PR without merging. During this process I've realized that Discovery should be limited to finding your neighbor's addresses BEFORE we connect, and therefore should NOT be moved down into the magicsock.

Discovery is for finding our neighbor's addresses BEFORE we connect
Discovery should be limited to connection initialization. We should not run discovery once a connection has already been established, and we should rely on our other systems to heal connections when a node has migrated to a new network or has a new network configuration.

This means we should ensure that we cannot / do not call quinn_connect for any NodeAddr that consists only of a NodeID.

If this situation occurs, calling quinn_connect on a NodeAddr with no relay URL or direct addrs, we have very limited ways of communicating this issue back up to the user, unless it is caught before we dial. Once we dial, the Endpoint will log an error, for each packet we attempt to send to a node with no relay URL or direct addrs, and eventually timeout.

How does the decision “Discovery should be limited to connection initialization” affect our current code? First, don't merge this PR.

Since we made magicsock private, we are the only ones who can get ourselves into this state. We need to make sure we never call quinn_connect with an empty node address.

However, in the future, if we ever get more aggressive in pruning bad addresses, we may run into this dilemma again. I suggest that we never prune addresses WHILE a connection is happening and only before we attempt to connect or on shutdown/cleanup. That way, we can launch discovery, if needed, before attempting to make the connection.

Outstanding questions for discovery:

Discovery currently works quickly because our current DNS setup works well. But in theory, especially if someone writes a distributed version of discovery, this can be very slow. In the IPFS paradigm, this also often leads to confusion—am I not connecting because the address I received from Discovery did not work, OR did Discovery still not respond?
Should we be adding timeouts to discovery? I think so.

@ramfox ramfox closed this May 23, 2024
@ramfox ramfox deleted the discovery_experiment branch May 23, 2024 18:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

2 participants