-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Some flakyness in windows-agent and wsl-pro-service #873
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #873 +/- ##
==========================================
+ Coverage 88.87% 88.95% +0.08%
==========================================
Files 103 105 +2
Lines 6812 6836 +24
==========================================
+ Hits 6054 6081 +27
- Misses 579 583 +4
+ Partials 179 172 -7 ☔ View full report in Codecov by Sentry. |
59e85ba
to
c7e4d86
Compare
c7e4d86
to
c6af159
Compare
This hides the Landscape config option (instead of disabling it). No changes in behaviour nor in how we decide to hide it. Some strings were changed also per guidance provided in #861. Finally I added a little bit more spacing in the top now that the subtitle is one line only. Here's how it looks like when compiled in release mode. ![image](https://github.com/user-attachments/assets/f9bb98aa-b544-44a2-82d2-6d5da2b186e0) Closes #861 --- I'm purposefully basing this PR on top of #873 so I can stress CI to exercise the changes therein introduced. There is no dependency between the two tasks. --- UDENG-3818
Sometimes copy-pasting a pro token comes with trailing spaces. Neither leading, trailing or even middle spaces are alloweable in a token. This PR leverages Flutter builtin mechanisms to filter input supplied to the token text field such that any kind of whitespaces (not only the ASCII ones) are discarded before any further processing. Closes #857 --- I'm purposefully basing this PR on top of #873 so I can stress CI to exercise the changes therein introduced. There is no dependency between the two tasks. --- UDENG-3810
ab46950
to
e0d42aa
Compare
TestNotifyConfigUpdate races with the distro worker: 1. It submits a config update to the distro 2. It attempts to read the pending tasks file 3. The distro worker may dequeue a task concurrent to that attempt. See one example of such failure: https://github.com/canonical/ubuntu-pro-for-wsl/actions/runs/10505222599/job/29102345493?pr=876 grep for 'Task_contains_empty_client_conf_when_both_are_empty' You'll see that the task was started: ` starting task \"LandscapeConfigure\""` Either success or failure, that ends up removing the task of the queue. If that happens before we attempt to read the tasks file the test fails. By calling distro.Cleanup() we effectively stop its worker, ensuring no dequeueing and no racing.
I couldn't see a good explanation for races in the worker and task management components that would match the pattern I observe in the failed tests: 1. worker.SubmitTasks submits and immediate task. 2. the already running goroutine processing tasks is expected to find and dequeue that task. 3. require.Eventually asserts the side effects of that task. (assuming the go runtime enforces memory sequence consistency when mutexes are employed -- that's documented) ``` time="2024-07-26T15:14:24Z" level=debug msg="Distro \"testDistro_UP4W_TestTaskProcessing_Success_executing_a_task_7763261449570405205\": starting task processing" time="2024-07-26T15:14:24Z" level=info msg="Distro \"testDistro_UP4W_TestTaskProcessing_Success_executing_a_task_7763261449570405205\": Submitting tasks [\"Test task\"] to queue" time="2024-07-26T15:14:25Z" level=debug msg="Distro \"testDistro_UP4W_TestTaskProcessing_Success_executing_a_task_7763261449570405205\": stopping task processing" ``` The second log line confirms a synchronous call is made to worker.Submit, and the failed assertion proves it returned. Since it's synchronous the task was delivered. Processing is asynchronous. The goroutine is running (maybe paused, but that's up to the runtime). If the task was dequeued we'd see this line in the log: ``` time="2024-07-26T15:14:23Z" level=debug msg="Distro \"testDistro_UP4W_TestTaskProcessing_Success_executing_a_task_7763261449570405205\": starting task \"Test task\"" ``` So it seems that the goroutine didn't have a chance to come to foreground to dequeue a task (pure speculation :) ). Since I couldn't find a reliable way to reproduce this error, I'm in favor of adding more time between require.Eventually calls, to see if that increases the likelyhood of that goroutine being scheduled and picking up the posted task. If this is correct, it should address all the tests that commonly fail in "Run Go Tests with mocks" CI jobs.
The first line of this method shadowed the context parameter, that results in gracefulCtx being child of ctx, which contributes to the spurious failures observed. Co-authored-by: Didier Roche-Tolomelli <[email protected]>
We should use the stream's and server's contexts directly. The subcontexts made the code obscure. This aims to make clearer that forceful stop cancels the streams contexts while the graceful stop allows the RPCs to finish their current activities on their own before stopping the loops. This method derived two context ojects mirroring the structure found in the Server type: `gCtx` for graceful stop and `ctx` for "brute forcing". They were arrenged in such a way that the local `gCtx` is child of `ctx`. If the server own `ctx` is cancelled to force a stop, depending on the moment that happens it could be interpreted by this method as a graceful stop (because it only checks `gCtx` and that can be cancelled due its parent being cancelled). The whole point here is that `server.Stop()` is assumed in tests to cause `server.Serve()` to return a non-nil error. But that can only happen if the handling loop objects fail or perceive a brute-force cancellation. If `handlingLooop[Command].run()` perceives `server.Stop()` as a graceful cancellation, then they won't return errors. Co-authored-by: Didier Roche-Tolomelli <[email protected]>
Tests has shown that's possible to have writers reaching the channel before any reader, thus the notification may never arrive. The amount of writers is a best-guess at this moment and may be adjusted.
1b0c642
to
4ed6408
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’m unsure about the amount of writers, but I will trust your manual tests on it and we’ll see with more usage. Thanks for fixing those flakyness!
Addressing windows-agent failures
The tests that fail often are contained in
windows-agent/internal/distros/worker/worker_test.go
. they all show the same pattern:I could find half explanation for the failure mode observed:
The second log line confirms a synchronous call is made to worker.Submit, and the failed assertion proves it returned. Since it's synchronous the task was delivered.
Processing is asynchronous. The goroutine is running (maybe paused, but that's up to the runtime). If the task was dequeued we'd see this line in the log:
If we augment the task queue Push and Pull methods with more logs to observe how submission and dequeuing behave, we find an interesting pattern: the writer arrives to the channel earlier than the reader when that failure mode happens.
I couldn't understand why the reader would arrive later, since the goroutine supposed to process incoming tasks starts earlier in the test cases, but somehow sometimes it only blocks on the channel after the writer attempted to write into and failed. I'd be surprised if the attempt to acquire a read lock inside the Pull method delayed it that much. The channel is unbuffered and the select-default statement prevents the writter from blocking if there are no readers.
Removing the select-default is not an option, test cases would remain blocked until timeout. In production that's less likely to happen, but still undesirable.
I opted into making the
q.wait
channel buffered, so some ammount of writters will succeed without blocking, giving the reader more chances to reach the channel and subsequent writers should have more chances of success as well.I don't have a good explanation for the exact number of slots the channel should have.
I ran the worker tests 1367 times in my machine without a single failure, so this seems to be a suitable fix.
go test -v .\windows-agent\internal\distros\worker\ -tags=gowslmock -count=1367 -failfast -timeout=120m
. :)Before that I was experimenting failures quite often when running with
--count=200
with any of theTestTaskProcessing
orTestTaskDeduplication
test cases.wsl-pro-service
There is still one test failing sometimes:
wsl-pro-service/internal/streams/server_test.go
. The test assumes that whenserver.Stop()
is called the gorountine serving the multistream service will exit with an error. It seems that the current implementation of the internalhandlingLoop[Command].run()
method connects the context objects used in the RPCs in a way that allows for its loop to interpret aserver.Stop()
as a graceful cancellation.This method derives two context ojects mirroring the structure found in the Server type:
gCtx
for graceful stop andctx
for "brute forcing". Those local context objects added complexity and it seems we can get rid of them.Additionally, there was an error in the server constructor
NewServer
causing thegracefulCtx
to be a child of thectx
due a shadow declaration.Per my current understanding this change in wsl-pro-service seems to not affect its runtime behaviour apart from responding to force-exit internal requests.
UDENG-3311