Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

traffic-resilience-http: Fix flaky testStopAcceptingConnections() test #3125

Merged
merged 13 commits into from
Dec 5, 2024

Conversation

bryce-anderson
Copy link
Contributor

@bryce-anderson bryce-anderson commented Dec 2, 2024

Motivation:

The TrafficResilienceHttpServiceFilterTest.testStopAcceptingConnections(..) test is flaky and sometimes will find that a connection times out before it was expected to. These additional connections are made to account for intrinsic races as well as kernel connection backlog behavior which is really hard to control.

Modifications:

The timeout behavior is the desired result, it just happens before we expect it to. So, instead of trying to make exactly two connections that we expect to succeed followed by one that should fail, we instead can just perform a low iteration loop and demand that we eventually stop accepting connections, which is the desired behavior.

Closes #3076.

@bryce-anderson bryce-anderson changed the title DO NO MERGE: trying to repro issue 3076 traffic-resilience-http: Fix flaky testStopAcceptingConnections() test Dec 3, 2024
@bryce-anderson bryce-anderson marked this pull request as ready for review December 3, 2024 23:35
@bryce-anderson
Copy link
Contributor Author

The initial commits all ran the offending test in a loop to surface the problem. The last few commits with ✅ runs had some form of the test fix and successfully repeated the test.

Copy link
Member

@idelpivnitskiy idelpivnitskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice one!

// We expect up to a couple connections to succeed due to the intrinsic race between disabling accepts
// and new connect requests, as well as to account for kernel connect backlog. However, we do expect it
// to fail after a fairly short number of iterations.
for (int i = 0; i < 5; i++) {
Copy link
Member

@Scottmitch Scottmitch Dec 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 -> should this be proportional to TCP_BACKLOG? I read the commit message description but I'm assuming the test can sequence such that "we stop accepting connections" and "try to make a new connection" so the kernel should only accept TCP_BACKLOG [+ 1] connections before queues fill up bcz server isn't doing accept

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't enforce a happens-before wrt the 'stop accepting' and 'make a new one'. I've lowered the value to three, since that appears to be the upper bound for when we stop accepting connections.

Copy link
Member

@Scottmitch Scottmitch Dec 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does 3 come from? Can you clarify the mechanics and who is "we" in this statement when we stop accepting connections.? I'm assuming there should be some relationship to TCP_BACKLOG (how many connection kernel will syn/ack before the application calls accept)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 comes from the existing behavior and justifications of the tests:
1 - satisfying a pending accept
2 - backlog of 1 for both linux and darwin (although params have different interpretations)
3 - This one should fail.
It happens that 2 also fails from time to time with the behavior that we expect from 3, and that is what results in the flakyness.

The "we" is the listener loop.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would TCP_BACKLOG + 2 make sense with comments similar to what you have above (potentially considering the off-by-one difference for linux)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should base it on TCP_BACKLOG because that symbol has a different purpose than you'd expect (and thus it's big associated comment above the definition). The TLDR is that the backlog is effectively 1 across the board, but the value of TCP_BACKLOG the symbol is OS dependent because unfortunately linux and darwin have different interpretations of the value.

@bryce-anderson bryce-anderson merged commit 74dc8c1 into apple:main Dec 5, 2024
11 checks passed
@bryce-anderson bryce-anderson deleted the bl_anderson/issue_3076 branch December 5, 2024 02:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TrafficResilienceHttpServiceFilterTest. testStopAcceptingConnections failure
3 participants