-
Notifications
You must be signed in to change notification settings - Fork 289
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce automated test crashes #2968
base: main
Are you sure you want to change the base?
Conversation
The key advantage is that exceptions propagate properly. If a thread throws an exception (as a result of a failed test assertion, or otherwise) then the test host crashes and must be restarted.
…tion behaviour testing. This should also allow the test to run on both netcore and netfx.
While not strictly related to this, two comments on the The flaky part of the test methodology is as follows:
Besides using threads directly, the current code enables Thread A to run immediately, while Thread B sleeps for 0.5s. This introduces a race condition where sometimes the command could have finished executing before the cancellation runs. In such a case, the original Assert.Throws call would fail (because ExecuteNonQuery/ExecuteReader completes without throwing an InvalidOperationException.) There was also an edge case where ExecuteNonQuery/ExecuteReader might throw a SqlException rather than an InvalidOperationException. In both of these cases, the exceptions went unhandled, the thread crashed and the testhost restarted. I've tried to rework this coordination slightly with an interlock and a tweak to the approach:
This reduces the race condition, but doesn't eliminate it: calling ExecuteNonQuery and ExecuteReader implicitly resets the SqlCommand's cancellation status, so it's possible that thread B might call Cancel, then thread A might reset this. We can never guarantee that a single Cancel call (or X sequential Cancel calls) will be enough to actually cancel the command; this is why Thread B now calls Cancel in a loop. A second comment on the same TestSqlCommandCancel test: following my changes, there's another couple of errors which look worrying: Managed SNI, on Linux
Native SNI, on Windows
I think the managed SNI exception message is probably "The request failed to run because the batch is aborted, this can be caused by abort signal sent from client, or another request is running in the same session, which makes the session busy." If so, it's another instance of #26. The native SNI exception message looks like a bug though. While this test can fail in two ways, the failure now shows up as a test failure rather than a crash & restart of testhost. I can revert my third commit (the changes to TestSqlCommandCancel) if it's better discussed in an issue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@edwardneal thanks so much for helping to make our automated testing more robust!! This is one of my biggest complaints in my day to day work, so anything that reduces the number of times I have to restart a build b/c of flaky tests is super appreciated.
I do have some concerns regarding the interlocking change, and a few other comments I'd like to see addressed before checking it in. But, I'll approve it asap once they are addressed
src/Microsoft.Data.SqlClient/tests/ManualTests/AlwaysEncrypted/TestFixtures/DatabaseHelper.cs
Show resolved
Hide resolved
src/Microsoft.Data.SqlClient/tests/ManualTests/SQL/MirroringTest/ConnectionOnMirroringTest.cs
Show resolved
Hide resolved
src/Microsoft.Data.SqlClient/tests/ManualTests/SQL/ParameterTest/ParametersTest.cs
Outdated
Show resolved
Hide resolved
src/Microsoft.Data.SqlClient/tests/ManualTests/SQL/ParameterTest/ParametersTest.cs
Outdated
Show resolved
Hide resolved
src/Microsoft.Data.SqlClient/tests/ManualTests/AlwaysEncrypted/ApiShould.cs
Outdated
Show resolved
Hide resolved
* Removed two unnecessary iterations from DatabaseHelper. * Added explanatory comments to ApiShould. * Switched to using Task.WaitAll rather than waiting for each Task in sequence.
@edwardneal Thanks again for working on this. I really really appreciate any help with improving our build process. Once we get another review, I'll be happy to merge this 👍🚀 |
It seems there are test failures:
|
It also appears this test failed on Windows too in first round and passed on rerunning, so I'm a little skeptic of this change, though they look promising..
|
Thanks @cheenamalhotra, that's fair. The main objective of this PR is to prevent the exceptions thrown by the secondary thread from killing the testhost process. At the moment, a flaky test kills the process and CI restarts it; this adds 10-15 minutes to the build time. With this change, the flaky test fails, the process returns a non-zero exit code and CI restarts it. I hadn't spotted the second restart criteria earlier, so this PR doesn't reduce the build time as I first thought. It does make some previously-failing tests visible though, and hopefully some future work can start moving progressively more tests into a test run which doesn't have that restart criteria, allowing us to shift the remaining ones towards restarting the individual test rather than the entire testhost. The exception message which starts with "The request failed to run because the bat" is a simple enough fix, I'm pretty sure it's just another instance of efcore#29861 and will change the test to account for that. I'm more concerned about the exception with the corrupted message string though. That looks a little like the native SNI's Edit: looks like there are a few additional possible exception messages which can report cancellation. That's a little frustrating - I'd hoped running it until it failed in VS would catch them all. I'll keep looking... |
Cancellation can trigger one of several different errors, resulting in a flakier test. Also ensure that the query always takes more than 150ms, ensuring that a quick query execution doesn't cause the test to fail. Finally, make sure that we try to read everything from the SqlDataReader.
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
This is an attempt to stabilise the automated tests slightly. They sometimes crash and need to be restarted; I've recently been able to reproduce the same behaviour in my local testing.
Several tests instantiate a number of threads, and these threads don't handle exceptions (such as failed test assertions.) If these assertions fail, the testhost exits.
I've converted these threads to tasks, specifying TaskCreationOptions.LongRunning. This will force the task scheduler to give the task its own thread, and it'll mean that the exceptions are propagated back to the caller. The test will thus fail, but the testhost will stay running.
This should hopefully mean that ManualTests consistently takes about 15-20 minutes to run, rather than occasionally taking 30-40 minutes.