fix: Some flakyness in windows-agent and wsl-pro-service #873

CarlosNihelton · 2024-08-19T16:05:47Z

Addressing windows-agent failures

The tests that fail often are contained in windows-agent/internal/distros/worker/worker_test.go. they all show the same pattern:

worker.SubmitTasks submits an immediate task.
the already running goroutine processing tasks is expected to be notified and dequeue that task.
require.Eventually asserts the side effects of that task.

I could find half explanation for the failure mode observed:

time="2024-07-26T15:14:24Z" level=debug msg="Distro \"testDistro_UP4W_TestTaskProcessing_Success_executing_a_task_7763261449570405205\": starting task processing"
time="2024-07-26T15:14:24Z" level=info msg="Distro \"testDistro_UP4W_TestTaskProcessing_Success_executing_a_task_7763261449570405205\": Submitting tasks [\"Test task\"] to queue"
time="2024-07-26T15:14:25Z" level=debug msg="Distro \"testDistro_UP4W_TestTaskProcessing_Success_executing_a_task_7763261449570405205\": stopping task processing"

The second log line confirms a synchronous call is made to worker.Submit, and the failed assertion proves it returned. Since it's synchronous the task was delivered.

Processing is asynchronous. The goroutine is running (maybe paused, but that's up to the runtime). If the task was dequeued we'd see this line in the log:

time="2024-07-26T15:14:23Z" level=debug msg="Distro \"testDistro_UP4W_TestTaskProcessing_Success_executing_a_task_7763261449570405205\": starting task \"Test task\""

If we augment the task queue Push and Pull methods with more logs to observe how submission and dequeuing behave, we find an interesting pattern: the writer arrives to the channel earlier than the reader when that failure mode happens.

diff --git a/windows-agent/internal/distros/worker/task_queue.go b/windows-agent/internal/distros/worker/task_queue.go
index 74f2b9bf..cf9f08b9 100644
--- a/windows-agent/internal/distros/worker/task_queue.go
+++ b/windows-agent/internal/distros/worker/task_queue.go
@@ -4,6 +4,7 @@ import (
        "context"
        "sync"

+       log "github.com/canonical/ubuntu-pro-for-wsl/common/grpc/logstreamer"
        "github.com/canonical/ubuntu-pro-for-wsl/windows-agent/internal/distros/task"
 )

@@ -87,8 +88,10 @@ func (q *taskQueue) Push(t task.Task) {
        q.data = append(q.data, t)

        // Notify waiters if there are any
+       log.Warningf(context.TODO(), "+++++++++ NOTIFYING AWAITERS OF TASKS %v on channel", t)
        select {
        case q.wait <- struct{}{}:
+               log.Warningf(context.TODO(), "+++++++++ NOTIFY COMPLETE %v", t)
        default:
        }
 }
@@ -155,14 +158,16 @@ func (q *taskQueue) Pull(ctx context.Context) task.Task {
                if task, ok := q.tryPopFront(); ok {
                        return task
                }

                q.mu.RLock()
                // This is mostly to appease the race detector
                wait := q.wait
+               log.Warning(ctx, "+++++++++ Pull: Waiting on the channel")
                q.mu.RUnlock()

                select {
                case <-ctx.Done():
+                       log.Warning(ctx, "+++++++++ Pull: inside the loop CTX DONE")
                        return nil
                case <-wait:
                        // ↑
@@ -170,6 +175,7 @@ func (q *taskQueue) Pull(ctx context.Context) task.Task {
                        // | only entry in the queue. Or an empty Load could
                        // | leave an empty "data" behind.
                        // ↓
+                       log.Warning(ctx, "+++++++++ TRY POPPING after the lock")
                        if task, ok := q.tryPopFront(); ok {
                                return task
                        }

I couldn't understand why the reader would arrive later, since the goroutine supposed to process incoming tasks starts earlier in the test cases, but somehow sometimes it only blocks on the channel after the writer attempted to write into and failed. I'd be surprised if the attempt to acquire a read lock inside the Pull method delayed it that much. The channel is unbuffered and the select-default statement prevents the writter from blocking if there are no readers.

=== RUN   TestTaskProcessing/Success_executing_a_task
=== PAUSE TestTaskProcessing/Success_executing_a_task
=== CONT  TestTaskProcessing/Success_executing_a_task
time="2024-08-28T15:26:15-03:00" level=debug msg="Distro \"testDistro_UP4W_TestTaskProcessing_Success_executing_a_task_2931794278530498729\": starting task processing"
time="2024-08-28T15:26:15-03:00" level=info msg="Distro \"testDistro_UP4W_TestTaskProcessing_Success_executing_a_task_2931794278530498729\": Submitting tasks [\"Test task\"] to queue"
time="2024-08-28T15:26:15-03:00" level=warning msg="+++++++++ NOTIFYING AWAITERS OF TASKS Test task on channel"
time="2024-08-28T15:26:15-03:00" level=warning msg="+++++++++ Pull: Waiting on the channel"
    worker_test.go:206:
                Error Trace:    D:/UP4W/cloudinit/windows-agent/internal/distros/worker/worker_test.go:206
                Error:          Condition never satisfied
                Test:           TestTaskProcessing/Success_executing_a_task
                Messages:       distro should have been "Running" after SubmitTask(). Current state is "Stopped"
time="2024-08-28T15:26:20-03:00" level=debug msg="Distro \"testDistro_UP4W_TestTaskProcessing_Success_executing_a_task_2931794278530498729\": stopping task processing"
time="2024-08-28T15:26:20-03:00" level=warning msg="+++++++++ Pull: inside the loop CTX DONE"
--- FAIL: TestTaskProcessing (0.00s)
    --- FAIL: TestTaskProcessing/Success_executing_a_task (5.02s)

Removing the select-default is not an option, test cases would remain blocked until timeout. In production that's less likely to happen, but still undesirable.

I opted into making the q.wait channel buffered, so some ammount of writters will succeed without blocking, giving the reader more chances to reach the channel and subsequent writers should have more chances of success as well.
I don't have a good explanation for the exact number of slots the channel should have.

I ran the worker tests 1367 times in my machine without a single failure, so this seems to be a suitable fix.

go test -v .\windows-agent\internal\distros\worker\ -tags=gowslmock -count=1367 -failfast -timeout=120m. :)

Before that I was experimenting failures quite often when running with --count=200 with any of the TestTaskProcessing or TestTaskDeduplication test cases.

wsl-pro-service

There is still one test failing sometimes: wsl-pro-service/internal/streams/server_test.go. The test assumes that when server.Stop() is called the gorountine serving the multistream service will exit with an error. It seems that the current implementation of the internal handlingLoop[Command].run() method connects the context objects used in the RPCs in a way that allows for its loop to interpret a server.Stop() as a graceful cancellation.

This method derives two context ojects mirroring the structure found in the Server type: gCtx for graceful stop and ctx for "brute forcing". Those local context objects added complexity and it seems we can get rid of them.

Additionally, there was an error in the server constructor NewServer causing the gracefulCtx to be a child of the ctx due a shadow declaration.

Per my current understanding this change in wsl-pro-service seems to not affect its runtime behaviour apart from responding to force-exit internal requests.

UDENG-3311

codecov · 2024-08-19T16:08:10Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 88.95%. Comparing base (7b95165) to head (4ed6408).
Report is 6 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #873      +/-   ##
==========================================
+ Coverage   88.87%   88.95%   +0.08%     
==========================================
  Files         103      105       +2     
  Lines        6812     6836      +24     
==========================================
+ Hits         6054     6081      +27     
- Misses        579      583       +4     
+ Partials      179      172       -7

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

This hides the Landscape config option (instead of disabling it). No changes in behaviour nor in how we decide to hide it. Some strings were changed also per guidance provided in #861. Finally I added a little bit more spacing in the top now that the subtitle is one line only. Here's how it looks like when compiled in release mode. ![image](https://github.com/user-attachments/assets/f9bb98aa-b544-44a2-82d2-6d5da2b186e0) Closes #861 --- I'm purposefully basing this PR on top of #873 so I can stress CI to exercise the changes therein introduced. There is no dependency between the two tasks. --- UDENG-3818

Sometimes copy-pasting a pro token comes with trailing spaces. Neither leading, trailing or even middle spaces are alloweable in a token. This PR leverages Flutter builtin mechanisms to filter input supplied to the token text field such that any kind of whitespaces (not only the ASCII ones) are discarded before any further processing. Closes #857 --- I'm purposefully basing this PR on top of #873 so I can stress CI to exercise the changes therein introduced. There is no dependency between the two tasks. --- UDENG-3810

TestNotifyConfigUpdate races with the distro worker: 1. It submits a config update to the distro 2. It attempts to read the pending tasks file 3. The distro worker may dequeue a task concurrent to that attempt. See one example of such failure: https://github.com/canonical/ubuntu-pro-for-wsl/actions/runs/10505222599/job/29102345493?pr=876 grep for 'Task_contains_empty_client_conf_when_both_are_empty' You'll see that the task was started: ` starting task \"LandscapeConfigure\""` Either success or failure, that ends up removing the task of the queue. If that happens before we attempt to read the tasks file the test fails. By calling distro.Cleanup() we effectively stop its worker, ensuring no dequeueing and no racing.

I couldn't see a good explanation for races in the worker and task management components that would match the pattern I observe in the failed tests: 1. worker.SubmitTasks submits and immediate task. 2. the already running goroutine processing tasks is expected to find and dequeue that task. 3. require.Eventually asserts the side effects of that task. (assuming the go runtime enforces memory sequence consistency when mutexes are employed -- that's documented) ``` time="2024-07-26T15:14:24Z" level=debug msg="Distro \"testDistro_UP4W_TestTaskProcessing_Success_executing_a_task_7763261449570405205\": starting task processing" time="2024-07-26T15:14:24Z" level=info msg="Distro \"testDistro_UP4W_TestTaskProcessing_Success_executing_a_task_7763261449570405205\": Submitting tasks [\"Test task\"] to queue" time="2024-07-26T15:14:25Z" level=debug msg="Distro \"testDistro_UP4W_TestTaskProcessing_Success_executing_a_task_7763261449570405205\": stopping task processing" ``` The second log line confirms a synchronous call is made to worker.Submit, and the failed assertion proves it returned. Since it's synchronous the task was delivered. Processing is asynchronous. The goroutine is running (maybe paused, but that's up to the runtime). If the task was dequeued we'd see this line in the log: ``` time="2024-07-26T15:14:23Z" level=debug msg="Distro \"testDistro_UP4W_TestTaskProcessing_Success_executing_a_task_7763261449570405205\": starting task \"Test task\"" ``` So it seems that the goroutine didn't have a chance to come to foreground to dequeue a task (pure speculation :) ). Since I couldn't find a reliable way to reproduce this error, I'm in favor of adding more time between require.Eventually calls, to see if that increases the likelyhood of that goroutine being scheduled and picking up the posted task. If this is correct, it should address all the tests that commonly fail in "Run Go Tests with mocks" CI jobs.

The first line of this method shadowed the context parameter, that results in gracefulCtx being child of ctx, which contributes to the spurious failures observed. Co-authored-by: Didier Roche-Tolomelli <[email protected]>

We should use the stream's and server's contexts directly. The subcontexts made the code obscure. This aims to make clearer that forceful stop cancels the streams contexts while the graceful stop allows the RPCs to finish their current activities on their own before stopping the loops. This method derived two context ojects mirroring the structure found in the Server type: `gCtx` for graceful stop and `ctx` for "brute forcing". They were arrenged in such a way that the local `gCtx` is child of `ctx`. If the server own `ctx` is cancelled to force a stop, depending on the moment that happens it could be interpreted by this method as a graceful stop (because it only checks `gCtx` and that can be cancelled due its parent being cancelled). The whole point here is that `server.Stop()` is assumed in tests to cause `server.Serve()` to return a non-nil error. But that can only happen if the handling loop objects fail or perceive a brute-force cancellation. If `handlingLooop[Command].run()` perceives `server.Stop()` as a graceful cancellation, then they won't return errors. Co-authored-by: Didier Roche-Tolomelli <[email protected]>

Tests has shown that's possible to have writers reaching the channel before any reader, thus the notification may never arrive. The amount of writers is a best-guess at this moment and may be adjusted.

didrocks

I’m unsure about the amount of writers, but I will trust your manual tests on it and we’ll see with more usage. Thanks for fixing those flakyness!

CarlosNihelton force-pushed the fix-agent-flakyness branch from 59e85ba to c7e4d86 Compare August 19, 2024 17:43

CarlosNihelton changed the title ~~fix(agent-tests): Give more slack in the require.Eventually calls.~~ fix: Some flakyness in windows-agent and wsl-pro-service Aug 19, 2024

CarlosNihelton marked this pull request as ready for review August 19, 2024 18:01

CarlosNihelton requested a review from a team as a code owner August 19, 2024 18:01

CarlosNihelton force-pushed the fix-agent-flakyness branch from c7e4d86 to c6af159 Compare August 19, 2024 19:45

This was referenced Aug 19, 2024

fix(gui): Slicker Landscape config page #874

Merged

fix(gui): Remove spaces from the Pro token input field #875

Merged

CarlosNihelton force-pushed the fix-agent-flakyness branch from ab46950 to e0d42aa Compare August 27, 2024 19:06

CarlosNihelton and others added 5 commits August 28, 2024 11:53

Fix wrong parenting relationship in server's ctx

1bc86bd

The first line of this method shadowed the context parameter, that results in gracefulCtx being child of ctx, which contributes to the spurious failures observed. Co-authored-by: Didier Roche-Tolomelli <[email protected]>

Make the task wait channel buffered.

4ed6408

Tests has shown that's possible to have writers reaching the channel before any reader, thus the notification may never arrive. The amount of writers is a best-guess at this moment and may be adjusted.

CarlosNihelton force-pushed the fix-agent-flakyness branch from 1b0c642 to 4ed6408 Compare August 28, 2024 19:04

CarlosNihelton marked this pull request as draft August 28, 2024 19:25

CarlosNihelton marked this pull request as ready for review August 28, 2024 19:25

CarlosNihelton mentioned this pull request Aug 28, 2024

fix(gui): Enable landscape by default #887

Merged

CarlosNihelton self-assigned this Aug 28, 2024

didrocks approved these changes Aug 29, 2024

View reviewed changes

CarlosNihelton merged commit 1b8dea7 into main Aug 29, 2024
38 checks passed

CarlosNihelton deleted the fix-agent-flakyness branch August 29, 2024 12:22

CarlosNihelton mentioned this pull request Sep 6, 2024

Flaky tests (triggered in autopkgtests on armhf) #535

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Some flakyness in windows-agent and wsl-pro-service #873

fix: Some flakyness in windows-agent and wsl-pro-service #873

CarlosNihelton commented Aug 19, 2024 •

edited

Loading

codecov bot commented Aug 19, 2024 •

edited

Loading

didrocks left a comment

fix: Some flakyness in windows-agent and wsl-pro-service #873

fix: Some flakyness in windows-agent and wsl-pro-service #873

Conversation

CarlosNihelton commented Aug 19, 2024 • edited Loading

Addressing windows-agent failures

wsl-pro-service

codecov bot commented Aug 19, 2024 • edited Loading

Codecov Report

didrocks left a comment

Choose a reason for hiding this comment

CarlosNihelton commented Aug 19, 2024 •

edited

Loading

codecov bot commented Aug 19, 2024 •

edited

Loading