Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(iroh-net)!: improve magicsock's shutdown story #2227

Merged

Conversation

divagant-martian
Copy link
Contributor

@divagant-martian divagant-martian commented Apr 23, 2024

Description

  • Waits for connections to actually be closed by calling quinn::Endpoint::wait_idle after close.
  • Consumes the MagicEndpoint in an effort to express the fact that closing an endpoint is a terminal operation.
  • Adds logging to closing the connections in case it's deemed necessary
  • Changes MagicSock::poll_recv to return Poll::Pending on close instead of an error. More info on the logic behind this change below.
  • Changes MagicSock::poll_send to not error if nothing is being sent.

Breaking Changes

MagicEndpoint::close now consumes the endpoint.

Notes & open questions

Shutdown MagicSock when quinn is done with it has proven something we can't do reliably.

  • Drop based solutions (close when the MagicSock is dropped) are unreliable because the socket is held by multiple components on quinn, including spawned detached tasks. There is no really any way to trigger the dropping of these tasks/components from outside quinn, and would require quite en effort to change it to happen inside quinn.
  • Close based solutions were rejected. The objective here was to stop polling the socket when the endpoint was both closed (close called) and all connections gracefully finalized. The reasoning here is that quinn both receives and sends frames after close to read new connection attempts and gracefully reject them. This is a fair argument on their side, despite it being clearly a limitation for a reliable freeing of resources from quinn.
  • Taking into account the fact that the socket will be polled after closed, both to send and receive, returning an error from these will always produce the logging error quinn::endpoint: I/O error: connection closed we have been chasing, even after the quinn::Endpoint has been dropped. Therefore changing poll_recv to return Poll::Pending addresses this annoyance.
  • Note that the part about gracefully shutting down is actually done by calling quinn::Endpoing::wait_idle and that the (now averted) log error in quinn doesn't really tells us whether shutdown was or not gracefull.
  • Note that the above point creates an API disparity between poll_send and poll_recv: one will error on close, the other will return Pending. I find this OK if MagicSock is simply a part of our implementation, but right now it's also part of our public API. I wonder if the MagicSock makes sense to users on its own or if we should remove it from the public API.
  • NOTE that conversation with quinn's maintainers to get a better api that allows to understand when all resources as freed has started. But this is playing the long game and we need to solve this on our end for now.

Change checklist

  • Self-review.
  • Documentation updates if relevant.
  • Tests if relevant.
  • All breaking changes documented.

This was linked to issues Apr 23, 2024
@flub
Copy link
Contributor

flub commented Apr 24, 2024

* Drop based solutions (close when the `MagicSock` is dropped) are unreliable because the socket is held by multiple components on `quinn`, including spawned detached tasks. There is no really any way to trigger the dropping of these tasks/components from outside `quinn`, and would require quite en effort to change it to happen inside `quinn`.

I haven't gone through all the pain of trying to implement this, so might have this wrong. Could you explain why the dropping of MagicSock by Quinn is not a suitable signal that it is done with it and can really be closed? I still don't fully grasp this, apologies. Is this mostly around the difficulty of achieving that drop inside of our MagicEndpoint while the MagicEndpoint itself is not yet dropped?

The most recent comment at quinn-rs/quinn#1828 (comment) also seems to think that dropping of the AsyncUdpSocket is a signal that all resources are released.

Also I think once Quinn drops the AsyncUdpSocket you can also be sure there's no more clone of the socket lingering around in some cloned connection, endpoint or whatever internal tasks.

iroh-net/src/magic_endpoint.rs Show resolved Hide resolved
iroh-net/src/magic_endpoint.rs Outdated Show resolved Hide resolved
iroh-net/src/magic_endpoint.rs Outdated Show resolved Hide resolved
iroh-net/src/magicsock.rs Show resolved Hide resolved
iroh-net/src/magicsock.rs Outdated Show resolved Hide resolved
@ramfox ramfox added this to the v0.15.0 milestone Apr 24, 2024
@divagant-martian
Copy link
Contributor Author

divagant-martian commented Apr 24, 2024

Signal on drop has two problems mainly:

  1. MagicEndpoint: Clone which makes for this horrible footgun:
let ep1 = MagicEndpoint::new();
let ep2 = ep1.clone();
ep2.close().await // this is a deadlock

Keeping track of clones is not easy so even in the presence of a big warning to users, I find such an API honestly unacceptable. It's not like a function xyz must be called before or should run inside a tokio runtime which are much more easily verifiable. It would be a you can only call this after dropping all clones of MagicEndpoint

  1. Clearly when the AsyncUdpSocket is dropped.. well it's dropped. The thing is that we can't reliably ensure this happened even if we drop the quinn::Endpoint, because it's the last thing to go. In fact, take a code such as this (where the MagicSock had no changes to poll_*
let msock = MagicSock::new();
let quinn_ep = quinn::Endpoint::new_with_abstract_socket(msock.clone());
drop(quinn_ep)
msock.close().await; // here the endpoint has been dropped but more 
                     // likely than not, the `AsyncUdpSocket` **hasn't** 
                     // been dropped yet. So more often than not you will 
                     // _still_ get that error bubbled up as the quinn error log.

The main problem is for sure the first one and it makes it IMO simply not worth it going such a convoluted path when a far simpler solution suffices

@divagant-martian divagant-martian added this pull request to the merge queue Apr 24, 2024
Merged via the queue into n0-computer:main with commit 265e284 Apr 24, 2024
21 checks passed
@flub
Copy link
Contributor

flub commented Apr 25, 2024

@divagant-martian thanks for the clear summary on the problems of drop!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Console shutdown is not clean Graceful shutdown of node
4 participants