-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(transport): various tcp transport races #1095
Conversation
This PR is more or less a rewrite of the TCP transport to fix several race conditions and leaks that can happen during startup / shutdown as well as spurious exceptions leaking out of it * simplify close monitoring since `ChronosStream` already closes the underlying chronos client of a `Connection` * fix dialling cancellation * stop accepting incoming connections before closing existing ones (race) * close in-flight connections during `TcpTransport.stop` * make sure only Transport-related exceptions leak out of it
@@ -81,16 +81,17 @@ proc dialAndUpgrade( | |||
if dialed.dir != dir: | |||
dialed.dir = dir | |||
await transport.upgrade(dialed, peerId) | |||
except CancelledError as exc: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please elaborate on why this is necessary and why we don't need to call close
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops, close should stay (ed6be85) - the point was to not return nil
which makes the dialler try the next address instead of aborting the dialling
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we catch only LPError
instead?
CancelledError, LPError], raw: true).} = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could, but for that to be safe, we would need to annotate dialAndUpgrade with raises annotations as well, so that changes in upgrade
would be caught by the compiler - this would significantly increase the scope of this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code catches and raises CancelledError
, I thought this would be equivalent to only catching LPError
instead of CatchableError
. upgrade
- the proc in the try/except
- only raises those two errors. But thinking more about it, you said we still want to call close
when catching CancelledError
, so my suggestion probably doesn't improve anything. Btw, did you forget to call close
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see what you mean now, didn't think about it. So even if the called proc is annotated, it isn't safe to trust it if the caller isn't annotated as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But thinking more about it, all errors in this project should inherit from LPError
, shouldn't they? In theory, it should be fine then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it's even more correct as LPError
should represent all the expected errors we thought about and are fine ignoring in this case. All the others shouldn't probably be swallowed here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Broadly, it's best when every abstraction has its own hierarchy - ie failure to decode a multihash is not related to a transport being closed in any meaningful way so having both derived from lperror doesn't really have any underlying motivation except that they happen to be implemented in the same codebase.
As such, it's usually best if exceptions are mapped to the abstraction layer that they're operating at, and each layer translates the exceptions coming from other layers to their own level - ie "socket closed" becomes "transport closed" becomes "peer disconnected" as it travels through the layers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as noted though, changing exception types is usually a breaking change, so it needs to be done carefully - in this particular case, it needs to be done across the entire transport hierarchy at the same time.
|
||
trace "Listening on", address = ma | ||
try: | ||
await procCall Transport(self).start(supported) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're probably trying to set "running" only after all the code above ran without issues, but it seems strange to call a parent method only at the end. It will be confusing to see the start log only after the start happened and the associated logs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a weakness of the inheritance / method structure of the code - ie the semantics of running
are such that we should not mark the transport as "running" until it has been fully started (all transports are running), but the way transport in general is implemented is that calling the base method is the way to "set" running.
The previous way of setting running
to true before actually starting the listeners is part of the buggy behavior that led to race conditions.
refactoring this flow would require refactoring all transport implementations which is beyond the scope of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If nothing important is being done in the parent and we don't plan to change it, perhaps we can stop calling it and set the flag manually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that requires refactoring all transports (ws/tor too) - something beyond the scope here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why refactoring? it's stop calling the parent and setting the flags in the child.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because to do it properly, one would remove the code from the base call (or really, change the structure so that the base contains functionality for this that doesn't involve inheritance) - then the other transports need to be updated - notably though, the base method needs refactoring anyway for raises
, so it can be done together then.
it would be confusing to have some implementations call the base and some not
if not self.handles(ma): | ||
trace "Invalid address detected, skipping!", address = ma | ||
continue | ||
trace "Starting TCP transport" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just realized there's a double log here as the parent start also logs this. I also believe this kind of long should be info, wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
broadly, we use any logging above trace level to provide a "side channel" when the API is too limited to transport the information to the caller.
the more a library sits at the core infrastructure level of an application, the less it should log because logging is part of the application responsibilities broadly - instead of logging, it's better for libraries to provide API where the application can choose to log certain events or not.
thus, any non-trace log in libp2p is really to be seen as a TODO to provide a better error reporting API to the consumer of libp2p - whether it is "correct" to log depends on use case and libp2p can provide greater flexibility to applications by providing the data needed to write good logs instead of doing the logging
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But couldn't the lib do both? The lib user can choose what level to use depending on their needs. Something like described here https://www.baeldung.com/hibernate-logging-levels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doing both is usually not needed - ie if an API has a rich error handling system "built in", there's not so much value in logs because the information that you would log is available to the consumer of the library. this is not a hard rule though, only a note that different applications will have different logging needs.
it is of course possible to make everything dynamic and configurable, but such options come at the cost of performance and complexity in usage and implementation
What do you think about splitting this PR into smaller ones, perhaps one for each TCP proc? There are too many changes in the TCP transport file, GitHub doesn't even load them by default. I'd argue it's too risky to make all those modifications at once. If there's an issue, it might be difficult to understand the cause. |
libp2p/transports/tcptransport.nim
Outdated
doAssert self.clients[Direction.In].len == 0 | ||
doAssert self.clients[Direction.Out].len == 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to crash the application if this condition isn't true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes - it would indicate a logic faiult in the flow (indeed, the fault that this PR seeks to solve, namely that of lingering connection leaks).
in an ideal world, we'd actually remove clients
entirely - broadly, it's not needed but its removal is a much more significant change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the application has been used so far and this condition isn't always true, i.e., this doesn't put the application in an unusable state. I wonder if it'd be better to just log a warning or even an error, but let it continue running until it's fixed.
raise newTransportClosedError() | ||
|
||
let ta = initTAddress(address).valueOr: | ||
raise (ref TcpTransportError)(msg: "Unsupported address: " & $address) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
more typing and you can't set custom fields, so (ref Xxx)
works in more cases when exceptions have rich error information
# end up with a race condition where `stop` returns but not all connections | ||
# have been closed - we drop connections in this case in order not to leak | ||
# them | ||
if self.stopping: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pattern is quite tricky and it's potentially unreliable to delegate to devs the responsibility to remember about those cases. Is there a way to automate it? Something happens when stop
is called that cancels all pending awaited futures and the necessary cleanups happen in an except
block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could add those futures in a seq in the transport when they are created, and make them remove themselves when they complete. When we stop the transport we cancel all the futures in the seq.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
broadly, when we transition to 2.0 and the new gc, it will be possible to make this more ergonomic - until then, it's simply a pattern to learn and repeat
# The connection had errors / was closed before `await` returned control | ||
await transp.closeWait() | ||
debug "Cannot read remote address", exc = exc.msg | ||
return nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we start returning nil
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it was always returning nil, by not returning anything (implicitly)
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1095 +/- ##
==========================================
- Coverage 84.53% 84.51% -0.03%
==========================================
Files 91 91
Lines 15517 15534 +17
==========================================
+ Hits 13118 13129 +11
- Misses 2399 2405 +6
|
This is flaky on mac https://github.com/vacp2p/nim-libp2p/actions/runs/9029804098/job/24812857646?pr=1095. It says an unhandled error was raised by |
no, but it would be good if this was fixed. |
It has been fixed here #1123 |
This PR is more or less a rewrite of the TCP transport to fix several race conditions and leaks that can happen during startup / shutdown as well as spurious exceptions leaking out of it
ChronosStream
already closes the underlying chronos client of aConnection
TcpTransport.stop