[TEST][NO-MERGE] Stress test named pipes #365
Draft
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changes are proposed in this pull request?
We are observing that the named pipe based communication in PosixPluginFrontend can get stuck on macOS with <1% chance. This can be confirmed by running the new test added here (running 10,000 times). The test is very likely to get stuck on macOS.
Upon further logging and debugging, it appears that the handler
Future
thread always finishes successfully:However, the custom plugin script is still waiting for output:
Thus the culprit appears to be with macOS named pipes. And indeed, based on the findings below, we can conclude that
Issues in macOS Named Pipes
Found Issue 1: Entire <=8192bytes message from dd always lost
Note: While this issue doesn't seem to be the one causing the
protoc-bridge
stuck issue (that issue was found on messages >8192bytes and reproduced on both short and long messages), it does show named pipes on macOS have deterministically strange behaviors.The most astounding finding is that in some cases like
dd
write <=8192 bytes then with a delay start the reader, named pipes on macOS deterministically throw away messages:Or
With some further experiments, it turns out
sudo sysctl net.local.stream.sendspace
. 8193 bytes will work fine: block the writer first until the reader starts and the reader and the writer will finish at the same time. Starting the reader around or before writer will also work fine.>
writers likecat
and a custom Python scriptdd.py
work fine with less than 8192 bytes, according topipe_stress_test.sh 8192 cat/dd.py dd
. It might be due to difference in flushing/caching, but in any case, all of them should send EOF.sudo fs_usage
. The pipe seems broken on its own.Found Issue 2: EOF lost occasionally on other messages
Note: This is likely what is causing the
protoc-bridge
stuck issue here.For messages >8192 bytes or shorter ones sent by
cat
/dd.py
, the pipe can throw away EOF non-deterministically (<0.1% chance), especially when the reader is started after a delay. It's less likely but can still occur when the reader is immediately started without a delay (probably <0.01% chance).For example, after 1778 iterations of
dd
into a pipe and a delayedcat
from the pipe on an M1 EC2 instance, the pipe got stuck.In this case, if we manually send EOF to the pipe, we can tell the previous data are correctly read, just missing an EOF for some reason:
This can be reproduced
dd
andcat
writer. So far, haven't reproduced withdd.py
writer but could just be timing differencedd
,cat
, anddd.py
reader