Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(iroh-net): return Poll::Read(Ok(n)) when we have no relay URL or direct addresses in poll_send #2322

Merged
merged 3 commits into from
Jun 6, 2024

Conversation

ramfox
Copy link
Contributor

@ramfox ramfox commented May 23, 2024

Description

If we have no relay URL or addresses to send transmits for a NodeID in poll_send, what do we do?

Returning Polling::Ready(Err(e)) causes the endpoint to error, which causes all connections to fail.

If we return Polling::Pending (in this case), we have no mechanism for waking the waker once the poll_send is returned. Also, even if we wake up and continue to poll, we will attempt to send the same transmits that we know we cannot send.

If we return Polling::Ready(Ok(0)), we will get into a loop in Quinn that attempts to keep re-sending the same transmits.

However, if we report back to Quinn that we have sent those transmits (by returning Polling::Ready(Ok(n))), then Quinn will move on and attempt to send new transmits. QUIC mechanisms might cause those transmits to be re-sent when we get no ACKs for them, but eventually, the connection will time out.

closes #2226

Change checklist

  • Self-review.
  • Documentation updates if relevant.
  • Tests if relevant.
  • All breaking changes documented.

@ramfox ramfox self-assigned this May 23, 2024
@ramfox ramfox changed the title return Poll::Pending when we have no relay URL or direct addresses … fix(iroh-net): return Poll::Pending when we have no relay URL or direct addresses in poll_send May 23, 2024
@Frando
Copy link
Member

Frando commented May 27, 2024

We won't ever call a waker in that case, so I'm wondering if not maybe Poll::Ready(Ok(0)) makes more sense?

Would have to check the quinn source though to see how this behaves for each variant I guess.

Edit: alternatively, if discovery moved into the MagicSock, we could actually store the waker here and wake once we discovered a relay or direct addr. That's more future thoughts though, let's first fix the current "killing the endpoint" situatio.

@ramfox
Copy link
Contributor Author

ramfox commented Jun 5, 2024

We won't ever call a waker in that case, so I'm wondering if not maybe Poll::Ready(Ok(0)) makes more sense?

Thank you, yes, you are right. But I am confused/concerned about one thing. If we never end up sending those messages and keep returning Ok(0), aren't we just going to keep attempting to send those same messages over and over again & never make any progress? Which may block other messages that are trying to be sent on other connections? Is the answer here to just drop the messages for that nodeID on the floor & report that we sent them?

My more aggressive thought would be that we should keep the error since ending up in a state where we have no address for the given node ID should never happen. And when it does happen, people will file bug reports because an error occurred 😂

Edit: alternatively, if discovery moved into the MagicSock, we could actually store the waker here and wake once we discovered a relay or direct addr. That's more future thoughts though, let's first fix the current "killing the endpoint" situation.

I've actually been converted to "We shouldn't move discovery into MagicSock". My reasoning is two-fold:

  1. The separation of concerns is actually a nice boundary to help us reason about what is "discovery" and what is "connectivity." It also allows us to communicate with the user. If discovery fails or no addresses exist, we can error before we attempt to connect.
  2. If two nodes have previously had a connection and one node changes its addresses, we already have healing mechanisms to communicate those changes using Disco messages and the relay. Also, since we don't prune addresses while the endpoint is running, we should never get into a case where we previously had a connection to a node and then suddenly have no addresses for that node.

I wrote more thoughts about this here: https://www.notion.so/number-zero/Connections-Discovery-and-Migration-9c34bad7dba24f80b20275ac78bdf202

Edits made for grammar

@Frando
Copy link
Member

Frando commented Jun 5, 2024

My more aggressive thought would be that we should keep the error since ending up in a state where we have no address for the given node ID should never happen. And when it does happen, people will file bug reports because an error occurred 😂

We can't unfortunately because this error kills the quinn endpoint, not only the connection.

I will check the quinn source what it does if we return Ready(Ok(0)). If that leads to retrying in a loop, then we should likely report the data as sent even though it isn't - this will then make the connection time out after 10s.

@flub
Copy link
Contributor

flub commented Jun 5, 2024

Also, since we don't prune addresses while the endpoint is running

We do prune unused addresses while running.

@flub
Copy link
Contributor

flub commented Jun 5, 2024

I think I'm going down the "pretend we sent it, but it got lost somewhere in the internet tubes". With pending you indeed never install a waker.

IIUC correctly Quinn quinn has a per-endpoint send buffer (per connection in 0.11), so returning pending will block everything in that buffer, including other connections.

@ramfox
Copy link
Contributor Author

ramfox commented Jun 5, 2024

We can't unfortunately because this error kills the quinn endpoint, not only the connection.

I know 😈

I will check the quinn source what it does if we return Ready(Ok(0)). If that leads to retrying in a loop, then we should likely report the data as sent even though it isn't - this will then make the connection time out after 10s.

In all seriousness though—it does retry in a loop, so then reporting the data as sent is the way to go. TY!!!

@ramfox ramfox marked this pull request as ready for review June 5, 2024 18:40
@ramfox
Copy link
Contributor Author

ramfox commented Jun 5, 2024

We do prune unused addresses while running.

Ahh yes, let me rephrase: we never prune to 0. This only matters if we have zero addrs for a node id.

@ramfox ramfox requested review from Frando, flub and dignifiedquire June 5, 2024 19:35
@ramfox ramfox changed the title fix(iroh-net): return Poll::Pending when we have no relay URL or direct addresses in poll_send fix(iroh-net): return Poll::Read(Ok(n)) when we have no relay URL or direct addresses in poll_send Jun 5, 2024
@ramfox ramfox added this to the v0.18.0 milestone Jun 6, 2024
@ramfox ramfox added this pull request to the merge queue Jun 6, 2024
@ramfox ramfox removed this pull request from the merge queue due to a manual request Jun 6, 2024
@ramfox ramfox added this pull request to the merge queue Jun 6, 2024
Merged via the queue into main with commit b2f0b0e Jun 6, 2024
32 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

bug: in poll_send, return Ok(n) when we have no available addr, or else quinn will kill the endpoint
3 participants