fix(iroh-net): return `Poll::Read(Ok(n))` when we have no relay URL or direct addresses in `poll_send` #2322

ramfox · 2024-05-23T20:14:17Z

Description

If we have no relay URL or addresses to send transmits for a NodeID in poll_send, what do we do?

Returning Polling::Ready(Err(e)) causes the endpoint to error, which causes all connections to fail.

If we return Polling::Pending (in this case), we have no mechanism for waking the waker once the poll_send is returned. Also, even if we wake up and continue to poll, we will attempt to send the same transmits that we know we cannot send.

If we return Polling::Ready(Ok(0)), we will get into a loop in Quinn that attempts to keep re-sending the same transmits.

However, if we report back to Quinn that we have sent those transmits (by returning Polling::Ready(Ok(n))), then Quinn will move on and attempt to send new transmits. QUIC mechanisms might cause those transmits to be re-sent when we get no ACKs for them, but eventually, the connection will time out.

closes #2226

Change checklist

Self-review.
Documentation updates if relevant.
Tests if relevant.
All breaking changes documented.

Frando · 2024-05-27T06:30:43Z

We won't ever call a waker in that case, so I'm wondering if not maybe Poll::Ready(Ok(0)) makes more sense?

Would have to check the quinn source though to see how this behaves for each variant I guess.

Edit: alternatively, if discovery moved into the MagicSock, we could actually store the waker here and wake once we discovered a relay or direct addr. That's more future thoughts though, let's first fix the current "killing the endpoint" situatio.

ramfox · 2024-06-05T00:22:40Z

We won't ever call a waker in that case, so I'm wondering if not maybe Poll::Ready(Ok(0)) makes more sense?

Thank you, yes, you are right. But I am confused/concerned about one thing. If we never end up sending those messages and keep returning Ok(0), aren't we just going to keep attempting to send those same messages over and over again & never make any progress? Which may block other messages that are trying to be sent on other connections? Is the answer here to just drop the messages for that nodeID on the floor & report that we sent them?

My more aggressive thought would be that we should keep the error since ending up in a state where we have no address for the given node ID should never happen. And when it does happen, people will file bug reports because an error occurred 😂

Edit: alternatively, if discovery moved into the MagicSock, we could actually store the waker here and wake once we discovered a relay or direct addr. That's more future thoughts though, let's first fix the current "killing the endpoint" situation.

I've actually been converted to "We shouldn't move discovery into MagicSock". My reasoning is two-fold:

The separation of concerns is actually a nice boundary to help us reason about what is "discovery" and what is "connectivity." It also allows us to communicate with the user. If discovery fails or no addresses exist, we can error before we attempt to connect.
If two nodes have previously had a connection and one node changes its addresses, we already have healing mechanisms to communicate those changes using Disco messages and the relay. Also, since we don't prune addresses while the endpoint is running, we should never get into a case where we previously had a connection to a node and then suddenly have no addresses for that node.

I wrote more thoughts about this here: https://www.notion.so/number-zero/Connections-Discovery-and-Migration-9c34bad7dba24f80b20275ac78bdf202

Edits made for grammar

Frando · 2024-06-05T08:14:06Z

My more aggressive thought would be that we should keep the error since ending up in a state where we have no address for the given node ID should never happen. And when it does happen, people will file bug reports because an error occurred 😂

We can't unfortunately because this error kills the quinn endpoint, not only the connection.

I will check the quinn source what it does if we return Ready(Ok(0)). If that leads to retrying in a loop, then we should likely report the data as sent even though it isn't - this will then make the connection time out after 10s.

flub · 2024-06-05T08:37:18Z

Also, since we don't prune addresses while the endpoint is running

We do prune unused addresses while running.

flub · 2024-06-05T08:58:02Z

I think I'm going down the "pretend we sent it, but it got lost somewhere in the internet tubes". With pending you indeed never install a waker.

IIUC correctly Quinn quinn has a per-endpoint send buffer (per connection in 0.11), so returning pending will block everything in that buffer, including other connections.

ramfox · 2024-06-05T14:18:51Z

We can't unfortunately because this error kills the quinn endpoint, not only the connection.

I know 😈

I will check the quinn source what it does if we return Ready(Ok(0)). If that leads to retrying in a loop, then we should likely report the data as sent even though it isn't - this will then make the connection time out after 10s.

In all seriousness though—it does retry in a loop, so then reporting the data as sent is the way to go. TY!!!

…to send on

…ress for a node id

ramfox · 2024-06-05T18:42:34Z

We do prune unused addresses while running.

Ahh yes, let me rephrase: we never prune to 0. This only matters if we have zero addrs for a node id.

ramfox self-assigned this May 23, 2024

ramfox changed the title ~~return Poll::Pending when we have no relay URL or direct addresses …~~ fix(iroh-net): return Poll::Pending when we have no relay URL or direct addresses in poll_send May 23, 2024

ramfox added 2 commits June 5, 2024 13:58

return Poll::Pending when we have no relay URL or direct addresses …

99269e7

…to send on

fix(iroh-net): return Poll::Ready(Ok(n)) when we do not have an add…

cb6d75b

…ress for a node id

ramfox force-pushed the fix/poll_send branch from 9f2e906 to cb6d75b Compare June 5, 2024 18:35

ramfox marked this pull request as ready for review June 5, 2024 18:40

Merge branch 'main' into fix/poll_send

e1b4905

ramfox requested review from Frando, flub and dignifiedquire June 5, 2024 19:35

ramfox changed the title ~~fix(iroh-net): return Poll::Pending when we have no relay URL or direct addresses in poll_send~~ fix(iroh-net): return Poll::Read(Ok(n)) when we have no relay URL or direct addresses in poll_send Jun 5, 2024

flub approved these changes Jun 6, 2024

View reviewed changes

Frando approved these changes Jun 6, 2024

View reviewed changes

ramfox added this to the v0.18.0 milestone Jun 6, 2024

ramfox added this pull request to the merge queue Jun 6, 2024

ramfox removed this pull request from the merge queue due to a manual request Jun 6, 2024

ramfox added this pull request to the merge queue Jun 6, 2024

Merged via the queue into main with commit b2f0b0e Jun 6, 2024
32 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(iroh-net): return `Poll::Read(Ok(n))` when we have no relay URL or direct addresses in `poll_send` #2322

fix(iroh-net): return `Poll::Read(Ok(n))` when we have no relay URL or direct addresses in `poll_send` #2322

ramfox commented May 23, 2024 •

edited

Loading

Frando commented May 27, 2024 •

edited

Loading

ramfox commented Jun 5, 2024 •

edited

Loading

Frando commented Jun 5, 2024

flub commented Jun 5, 2024

flub commented Jun 5, 2024

ramfox commented Jun 5, 2024

ramfox commented Jun 5, 2024 •

edited

Loading

fix(iroh-net): return Poll::Read(Ok(n)) when we have no relay URL or direct addresses in poll_send #2322

fix(iroh-net): return Poll::Read(Ok(n)) when we have no relay URL or direct addresses in poll_send #2322

Conversation

ramfox commented May 23, 2024 • edited Loading

Description

Change checklist

Frando commented May 27, 2024 • edited Loading

ramfox commented Jun 5, 2024 • edited Loading

Frando commented Jun 5, 2024

flub commented Jun 5, 2024

flub commented Jun 5, 2024

ramfox commented Jun 5, 2024

ramfox commented Jun 5, 2024 • edited Loading

fix(iroh-net): return `Poll::Read(Ok(n))` when we have no relay URL or direct addresses in `poll_send` #2322

fix(iroh-net): return `Poll::Read(Ok(n))` when we have no relay URL or direct addresses in `poll_send` #2322

ramfox commented May 23, 2024 •

edited

Loading

Frando commented May 27, 2024 •

edited

Loading

ramfox commented Jun 5, 2024 •

edited

Loading

ramfox commented Jun 5, 2024 •

edited

Loading