fix(transport): various tcp transport races #1095

arnetheduck · 2024-05-07T13:43:27Z

This PR is more or less a rewrite of the TCP transport to fix several race conditions and leaks that can happen during startup / shutdown as well as spurious exceptions leaking out of it

simplify close monitoring since ChronosStream already closes the underlying chronos client of a Connection
fix dialling cancellation
stop accepting incoming connections before closing existing ones (race)
close in-flight connections during TcpTransport.stop
make sure only Transport-related exceptions leak out of it

This PR is more or less a rewrite of the TCP transport to fix several race conditions and leaks that can happen during startup / shutdown as well as spurious exceptions leaking out of it * simplify close monitoring since `ChronosStream` already closes the underlying chronos client of a `Connection` * fix dialling cancellation * stop accepting incoming connections before closing existing ones (race) * close in-flight connections during `TcpTransport.stop` * make sure only Transport-related exceptions leak out of it

diegomrsantos · 2024-05-07T18:26:09Z

libp2p/dialer.nim

@@ -81,16 +81,17 @@ proc dialAndUpgrade(
          if dialed.dir != dir:
            dialed.dir = dir
          await transport.upgrade(dialed, peerId)
+        except CancelledError as exc:


Could you please elaborate on why this is necessary and why we don't need to call close?

oops, close should stay (ed6be85) - the point was to not return nil which makes the dialler try the next address instead of aborting the dialling

Could we catch only LPError instead?

nim-libp2p/libp2p/transports/transport.nim

Line 88 in c5db35d

CancelledError, LPError], raw: true).} =

we could, but for that to be safe, we would need to annotate dialAndUpgrade with raises annotations as well, so that changes in upgrade would be caught by the compiler - this would significantly increase the scope of this PR

The code catches and raises CancelledError, I thought this would be equivalent to only catching LPError instead of CatchableError. upgrade - the proc in the try/except - only raises those two errors. But thinking more about it, you said we still want to call closewhen catching CancelledError, so my suggestion probably doesn't improve anything. Btw, did you forget to call close?

Ah, I see what you mean now, didn't think about it. So even if the called proc is annotated, it isn't safe to trust it if the caller isn't annotated as well.

But thinking more about it, all errors in this project should inherit from LPError, shouldn't they? In theory, it should be fine then.

Maybe it's even more correct as LPError should represent all the expected errors we thought about and are fine ignoring in this case. All the others shouldn't probably be swallowed here.

Broadly, it's best when every abstraction has its own hierarchy - ie failure to decode a multihash is not related to a transport being closed in any meaningful way so having both derived from lperror doesn't really have any underlying motivation except that they happen to be implemented in the same codebase.

As such, it's usually best if exceptions are mapped to the abstraction layer that they're operating at, and each layer translates the exceptions coming from other layers to their own level - ie "socket closed" becomes "transport closed" becomes "peer disconnected" as it travels through the layers.

as noted though, changing exception types is usually a breaking change, so it needs to be done carefully - in this particular case, it needs to be done across the entire transport hierarchy at the same time.

libp2p/switch.nim

libp2p/transports/tcptransport.nim

diegomrsantos · 2024-05-08T15:01:57Z

libp2p/transports/tcptransport.nim


-    trace "Listening on", address = ma
+    try:
+      await procCall Transport(self).start(supported)


You're probably trying to set "running" only after all the code above ran without issues, but it seems strange to call a parent method only at the end. It will be confusing to see the start log only after the start happened and the associated logs.

this is a weakness of the inheritance / method structure of the code - ie the semantics of running are such that we should not mark the transport as "running" until it has been fully started (all transports are running), but the way transport in general is implemented is that calling the base method is the way to "set" running.

The previous way of setting running to true before actually starting the listeners is part of the buggy behavior that led to race conditions.

refactoring this flow would require refactoring all transport implementations which is beyond the scope of this PR.

If nothing important is being done in the parent and we don't plan to change it, perhaps we can stop calling it and set the flag manually.

that requires refactoring all transports (ws/tor too) - something beyond the scope here

why refactoring? it's stop calling the parent and setting the flags in the child.

because to do it properly, one would remove the code from the base call (or really, change the structure so that the base contains functionality for this that doesn't involve inheritance) - then the other transports need to be updated - notably though, the base method needs refactoring anyway for raises, so it can be done together then.

it would be confusing to have some implementations call the base and some not

diegomrsantos · 2024-05-08T15:06:35Z

libp2p/transports/tcptransport.nim

-    if not self.handles(ma):
-      trace "Invalid address detected, skipping!", address = ma
-      continue
+    trace "Starting TCP transport"


I just realized there's a double log here as the parent start also logs this. I also believe this kind of long should be info, wdyt?

broadly, we use any logging above trace level to provide a "side channel" when the API is too limited to transport the information to the caller.

the more a library sits at the core infrastructure level of an application, the less it should log because logging is part of the application responsibilities broadly - instead of logging, it's better for libraries to provide API where the application can choose to log certain events or not.

thus, any non-trace log in libp2p is really to be seen as a TODO to provide a better error reporting API to the consumer of libp2p - whether it is "correct" to log depends on use case and libp2p can provide greater flexibility to applications by providing the data needed to write good logs instead of doing the logging

But couldn't the lib do both? The lib user can choose what level to use depending on their needs. Something like described here https://www.baeldung.com/hibernate-logging-levels.

doing both is usually not needed - ie if an API has a rich error handling system "built in", there's not so much value in logs because the information that you would log is available to the consumer of the library. this is not a hard rule though, only a note that different applications will have different logging needs.

it is of course possible to make everything dynamic and configurable, but such options come at the cost of performance and complexity in usage and implementation

libp2p/transports/tcptransport.nim

diegomrsantos · 2024-05-08T16:12:00Z

What do you think about splitting this PR into smaller ones, perhaps one for each TCP proc? There are too many changes in the TCP transport file, GitHub doesn't even load them by default. I'd argue it's too risky to make all those modifications at once. If there's an issue, it might be difficult to understand the cause.

diegomrsantos · 2024-05-08T19:12:33Z

libp2p/transports/tcptransport.nim

+      doAssert self.clients[Direction.In].len == 0
+      doAssert self.clients[Direction.Out].len == 0


Do we want to crash the application if this condition isn't true?

yes - it would indicate a logic faiult in the flow (indeed, the fault that this PR seeks to solve, namely that of lingering connection leaks).

in an ideal world, we'd actually remove clients entirely - broadly, it's not needed but its removal is a much more significant change

But the application has been used so far and this condition isn't always true, i.e., this doesn't put the application in an unusable state. I wonder if it'd be better to just log a warning or even an error, but let it continue running until it's fixed.

libp2p/transports/tcptransport.nim

diegomrsantos · 2024-05-09T12:02:41Z

libp2p/transports/tcptransport.nim

+      raise newTransportClosedError()
+
+    let ta = initTAddress(address).valueOr:
+      raise (ref TcpTransportError)(msg: "Unsupported address: " & $address)


why not https://nim-lang.org/docs/system.html#newException.t%2Ctypedesc%2Cstring%2Cref.Exception ?

more typing and you can't set custom fields, so (ref Xxx) works in more cases when exceptions have rich error information

diegomrsantos · 2024-05-09T12:20:14Z

libp2p/transports/tcptransport.nim

+    # end up with a race condition where `stop` returns but not all connections
+    # have been closed - we drop connections in this case in order not to leak
+    # them
+    if self.stopping:


This pattern is quite tricky and it's potentially unreliable to delegate to devs the responsibility to remember about those cases. Is there a way to automate it? Something happens when stop is called that cancels all pending awaited futures and the necessary cleanups happen in an except block.

We could add those futures in a seq in the transport when they are created, and make them remove themselves when they complete. When we stop the transport we cancel all the futures in the seq.

broadly, when we transition to 2.0 and the new gc, it will be possible to make this more ergonomic - until then, it's simply a pattern to learn and repeat

diegomrsantos · 2024-05-09T14:32:11Z

libp2p/transports/tcptransport.nim

+        # The connection had errors / was closed before `await` returned control
+        await transp.closeWait()
+        debug "Cannot read remote address", exc = exc.msg
+        return nil


why do we start returning nil here?

it was always returning nil, by not returning anything (implicitly)

codecov-commenter · 2024-05-09T15:29:28Z

Codecov Report

Attention: Patch coverage is 84.69945% with 28 lines in your changes are missing coverage. Please review.

Project coverage is 84.51%. Comparing base (02c96fc) to head (1c65f04).
Report is 4 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1095      +/-   ##
==========================================
- Coverage   84.53%   84.51%   -0.03%     
==========================================
  Files          91       91              
  Lines       15517    15534      +17     
==========================================
+ Hits        13118    13129      +11     
- Misses       2399     2405       +6

Files	Coverage Δ
libp2p/dialer.nim	`91.56% <100.00%> (+0.84%)`	⬆️
libp2p/errors.nim	`66.66% <ø> (ø)`
libp2p/switch.nim	`90.52% <100.00%> (-1.19%)`	⬇️
libp2p/transports/tortransport.nim	`83.23% <100.00%> (-0.10%)`	⬇️
libp2p/transports/transport.nim	`92.30% <100.00%> (+3.84%)`	⬆️
libp2p/wire.nim	`62.66% <66.66%> (+0.76%)`	⬆️
libp2p/transports/tcptransport.nim	`85.27% <84.02%> (+4.70%)`	⬆️

... and 7 files with indirect coverage changes

…cleanup

diegomrsantos · 2024-05-10T11:02:17Z

This is flaky on mac https://github.com/vacp2p/nim-libp2p/actions/runs/9029804098/job/24812857646?pr=1095.

It says an unhandled error was raised by testdaemon, but all tests passed. Do you know the proper way to fix that?

arnetheduck · 2024-05-11T16:26:47Z

It says an unhandled error was raised by testdaemon, but all tests passed. Do you know the proper way to fix that?

no, but it would be good if this was fixed.

diegomrsantos · 2024-06-25T11:39:13Z

It says an unhandled error was raised by testdaemon, but all tests passed. Do you know the proper way to fix that?

no, but it would be good if this was fixed.

It has been fixed here #1123

arnetheduck added 2 commits May 7, 2024 15:42

compat

ed6be85

diegomrsantos reviewed May 7, 2024

View reviewed changes

missing close

047019a

diegomrsantos reviewed May 8, 2024

View reviewed changes

libp2p/switch.nim Show resolved Hide resolved

diegomrsantos reviewed May 8, 2024

View reviewed changes

libp2p/transports/tcptransport.nim Show resolved Hide resolved