-
Notifications
You must be signed in to change notification settings - Fork 180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
traffic-resilience-http: Fix flaky testStopAcceptingConnections() test #3125
traffic-resilience-http: Fix flaky testStopAcceptingConnections() test #3125
Conversation
The initial commits all ran the offending test in a loop to surface the problem. The last few commits with ✅ runs had some form of the test fix and successfully repeated the test. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice one!
...test/java/io/servicetalk/traffic/resilience/http/TrafficResilienceHttpServiceFilterTest.java
Outdated
Show resolved
Hide resolved
...test/java/io/servicetalk/traffic/resilience/http/TrafficResilienceHttpServiceFilterTest.java
Outdated
Show resolved
Hide resolved
...test/java/io/servicetalk/traffic/resilience/http/TrafficResilienceHttpServiceFilterTest.java
Show resolved
Hide resolved
// We expect up to a couple connections to succeed due to the intrinsic race between disabling accepts | ||
// and new connect requests, as well as to account for kernel connect backlog. However, we do expect it | ||
// to fail after a fairly short number of iterations. | ||
for (int i = 0; i < 5; i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5
-> should this be proportional to TCP_BACKLOG
? I read the commit message description but I'm assuming the test can sequence such that "we stop accepting connections" and "try to make a new connection" so the kernel should only accept TCP_BACKLOG [+ 1]
connections before queues fill up bcz server isn't doing accept
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't enforce a happens-before wrt the 'stop accepting' and 'make a new one'. I've lowered the value to three, since that appears to be the upper bound for when we stop accepting connections.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where does 3 come from? Can you clarify the mechanics and who is "we" in this statement when we stop accepting connections.
? I'm assuming there should be some relationship to TCP_BACKLOG
(how many connection kernel will syn/ack before the application calls accept)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3 comes from the existing behavior and justifications of the tests:
1 - satisfying a pending accept
2 - backlog of 1 for both linux and darwin (although params have different interpretations)
3 - This one should fail.
It happens that 2 also fails from time to time with the behavior that we expect from 3, and that is what results in the flakyness.
The "we" is the listener loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would TCP_BACKLOG + 2
make sense with comments similar to what you have above (potentially considering the off-by-one difference for linux)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should base it on TCP_BACKLOG
because that symbol has a different purpose than you'd expect (and thus it's big associated comment above the definition). The TLDR is that the backlog is effectively 1 across the board, but the value of TCP_BACKLOG
the symbol is OS dependent because unfortunately linux and darwin have different interpretations of the value.
Motivation:
The
TrafficResilienceHttpServiceFilterTest.testStopAcceptingConnections(..)
test is flaky and sometimes will find that a connection times out before it was expected to. These additional connections are made to account for intrinsic races as well as kernel connection backlog behavior which is really hard to control.Modifications:
The timeout behavior is the desired result, it just happens before we expect it to. So, instead of trying to make exactly two connections that we expect to succeed followed by one that should fail, we instead can just perform a low iteration loop and demand that we eventually stop accepting connections, which is the desired behavior.
Closes #3076.