Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TEST][NO-MERGE] Stress test named pipes #365

Draft
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

bell-db
Copy link
Contributor

@bell-db bell-db commented Aug 9, 2024

What changes are proposed in this pull request?

We are observing that the named pipe based communication in PosixPluginFrontend can get stuck on macOS with <1% chance. This can be confirmed by running the new test added here (running 10,000 times). The test is very likely to get stuck on macOS.

sbt "testOnly protocbridge.frontend.PosixPluginFrontendSpec"

Upon further logging and debugging, it appears that the handler Future thread always finishes successfully:

PluginFrontend.runWithInputStream...
readInputStreamToByteArrayWithEnv...
runWithBytes...
fsin.close...
Files.newOutputStream...
fsout.write...
fsout.close...
blocking done.

$ jstack $PID
<No scala-execution-context-global threads>

However, the custom plugin script is still waiting for output:

$ ps aux
ec2-user         74676   0.0  0.0 410601216   1232   ??  S     5:28AM   0:00.00 cat /var/folders/p5/lml4x2f142v4qd1c1l5z_vlm0000gn/T/protopipe-1283543240596163707/output

Thus the culprit appears to be with macOS named pipes. And indeed, based on the findings below, we can conclude that

Named pipes on macOS are not reliable

Issues in macOS Named Pipes

Found Issue 1: Entire <=8192bytes message from dd always lost

Note: While this issue doesn't seem to be the one causing the protoc-bridge stuck issue (that issue was found on messages >8192bytes and reproduced on both short and long messages), it does show named pipes on macOS have deterministically strange behaviors.

The most astounding finding is that in some cases like dd write <=8192 bytes then with a delay start the reader, named pipes on macOS deterministically throw away messages:

mkfifo "$PIPE_PATH"

# This will immediately finish instead of waiting for a reader:
(dd if=/dev/urandom of="$PIPE_PATH" bs=1 count=8192 2>/dev/null && echo "Completed dumping random bytes to the pipe") &

sleep 1

# The bytes went nowhere, and a reader will be stuck without reading any of the previous bytes:
(cat "$PIPE_PATH" | wc -c && echo "Completed consuming random bytes from the pipe") &
# or
(dd if="$PIPE_PATH" 2>/dev/null && echo "Completed consuming random bytes from the pipe") &

Or

$ bash pipe_stress_test.sh 8192 dd cat
$ bash pipe_stress_test.sh 8192 dd dd

With some further experiments, it turns out

  • 8192 is the stream buffer size sudo sysctl net.local.stream.sendspace. 8193 bytes will work fine: block the writer first until the reader starts and the reader and the writer will finish at the same time. Starting the reader around or before writer will also work fine.
  • If we manually send an EOF, the reader will finish but won't read any of that 8192 bytes, so the buffer isn't working.
  • Terminal > writers like cat and a custom Python script dd.py work fine with less than 8192 bytes, according to pipe_stress_test.sh 8192 cat/dd.py dd. It might be due to difference in flushing/caching, but in any case, all of them should send EOF.
  • The same command is broken on both Intel and M1 macOS laptop and macOS EC2 instance. It works fine on Linux, as expected.
  • There are no interfering readers recorded in sudo fs_usage. The pipe seems broken on its own.

Found Issue 2: EOF lost occasionally on other messages

Note: This is likely what is causing the protoc-bridge stuck issue here.

For messages >8192 bytes or shorter ones sent by cat/dd.py, the pipe can throw away EOF non-deterministically (<0.1% chance), especially when the reader is started after a delay. It's less likely but can still occur when the reader is immediately started without a delay (probably <0.01% chance).

$ bash pipe_stress_test.sh 10000 dd dd
$ bash pipe_stress_test.sh 100 cat cat

For example, after 1778 iterations of dd into a pipe and a delayed cat from the pipe on an M1 EC2 instance, the pipe got stuck.

Iteration: 1778
Started monitoring the pipe /var/folders/p5/lml4x2f142v4qd1c1l5z_vlm0000gn/T/protopipe.x91I5Bmgpe/output (PID: 2393)
Random byte length: 67585 bytes
Started dumping random bytes to the pipe (PID: 2394)
Sleeping for: 28 milliseconds
Started consuming data from the pipe (PID: 2401)
Completed dumping random bytes to the pipe
The dumping process has stopped (PID: 2394)
<stuck>

In this case, if we manually send EOF to the pipe, we can tell the previous data are correctly read, just missing an EOF for some reason:

$ exec 3>/var/folders/p5/lml4x2f142v4qd1c1l5z_vlm0000gn/T/protopipe.x91I5Bmgpe/output
$ exec 3>&-
Completed consuming random bytes from the pipe
The consuming process has stopped (PID: 2401)
Stopped monitoring the pipe (PID: 2393)
Successfully read 67585 bytes from the pipe
Iteration: 1779

This can be reproduced

  • After another thousands of iterations
  • On both macOS laptop and macOS EC2 instance
  • With both dd and cat writer. So far, haven't reproduced with dd.py writer but could just be timing difference
  • With both dd, cat, and dd.py reader

@bell-db bell-db changed the title [NO-MERGE] Stress test named pipes [TEST][NO-MERGE] Stress test named pipes Aug 9, 2024
@bell-db bell-db force-pushed the bell-db/pipe-stress-test branch from bb9c820 to eea0ae0 Compare August 9, 2024 20:24
@bell-db bell-db marked this pull request as draft September 12, 2024 19:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant