Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Some flakyness in windows-agent and wsl-pro-service #873

Merged
merged 5 commits into from
Aug 29, 2024

Conversation

CarlosNihelton
Copy link
Contributor

@CarlosNihelton CarlosNihelton commented Aug 19, 2024

Addressing windows-agent failures

The tests that fail often are contained in windows-agent/internal/distros/worker/worker_test.go. they all show the same pattern:

  1. worker.SubmitTasks submits an immediate task.
  2. the already running goroutine processing tasks is expected to be notified and dequeue that task.
  3. require.Eventually asserts the side effects of that task.

I could find half explanation for the failure mode observed:

time="2024-07-26T15:14:24Z" level=debug msg="Distro \"testDistro_UP4W_TestTaskProcessing_Success_executing_a_task_7763261449570405205\": starting task processing"
time="2024-07-26T15:14:24Z" level=info msg="Distro \"testDistro_UP4W_TestTaskProcessing_Success_executing_a_task_7763261449570405205\": Submitting tasks [\"Test task\"] to queue"
time="2024-07-26T15:14:25Z" level=debug msg="Distro \"testDistro_UP4W_TestTaskProcessing_Success_executing_a_task_7763261449570405205\": stopping task processing"

The second log line confirms a synchronous call is made to worker.Submit, and the failed assertion proves it returned. Since it's synchronous the task was delivered.

Processing is asynchronous. The goroutine is running (maybe paused, but that's up to the runtime). If the task was dequeued we'd see this line in the log:

time="2024-07-26T15:14:23Z" level=debug msg="Distro \"testDistro_UP4W_TestTaskProcessing_Success_executing_a_task_7763261449570405205\": starting task \"Test task\""

If we augment the task queue Push and Pull methods with more logs to observe how submission and dequeuing behave, we find an interesting pattern: the writer arrives to the channel earlier than the reader when that failure mode happens.

diff --git a/windows-agent/internal/distros/worker/task_queue.go b/windows-agent/internal/distros/worker/task_queue.go
index 74f2b9bf..cf9f08b9 100644
--- a/windows-agent/internal/distros/worker/task_queue.go
+++ b/windows-agent/internal/distros/worker/task_queue.go
@@ -4,6 +4,7 @@ import (
        "context"
        "sync"

+       log "github.com/canonical/ubuntu-pro-for-wsl/common/grpc/logstreamer"
        "github.com/canonical/ubuntu-pro-for-wsl/windows-agent/internal/distros/task"
 )

@@ -87,8 +88,10 @@ func (q *taskQueue) Push(t task.Task) {
        q.data = append(q.data, t)

        // Notify waiters if there are any
+       log.Warningf(context.TODO(), "+++++++++ NOTIFYING AWAITERS OF TASKS %v on channel", t)
        select {
        case q.wait <- struct{}{}:
+               log.Warningf(context.TODO(), "+++++++++ NOTIFY COMPLETE %v", t)
        default:
        }
 }
@@ -155,14 +158,16 @@ func (q *taskQueue) Pull(ctx context.Context) task.Task {
                if task, ok := q.tryPopFront(); ok {
                        return task
                }

                q.mu.RLock()
                // This is mostly to appease the race detector
                wait := q.wait
+               log.Warning(ctx, "+++++++++ Pull: Waiting on the channel")
                q.mu.RUnlock()

                select {
                case <-ctx.Done():
+                       log.Warning(ctx, "+++++++++ Pull: inside the loop CTX DONE")
                        return nil
                case <-wait:
                        // ↑
@@ -170,6 +175,7 @@ func (q *taskQueue) Pull(ctx context.Context) task.Task {
                        // | only entry in the queue. Or an empty Load could
                        // | leave an empty "data" behind.
                        // ↓
+                       log.Warning(ctx, "+++++++++ TRY POPPING after the lock")
                        if task, ok := q.tryPopFront(); ok {
                                return task
                        }

I couldn't understand why the reader would arrive later, since the goroutine supposed to process incoming tasks starts earlier in the test cases, but somehow sometimes it only blocks on the channel after the writer attempted to write into and failed. I'd be surprised if the attempt to acquire a read lock inside the Pull method delayed it that much. The channel is unbuffered and the select-default statement prevents the writter from blocking if there are no readers.

=== RUN   TestTaskProcessing/Success_executing_a_task
=== PAUSE TestTaskProcessing/Success_executing_a_task
=== CONT  TestTaskProcessing/Success_executing_a_task
time="2024-08-28T15:26:15-03:00" level=debug msg="Distro \"testDistro_UP4W_TestTaskProcessing_Success_executing_a_task_2931794278530498729\": starting task processing"
time="2024-08-28T15:26:15-03:00" level=info msg="Distro \"testDistro_UP4W_TestTaskProcessing_Success_executing_a_task_2931794278530498729\": Submitting tasks [\"Test task\"] to queue"
time="2024-08-28T15:26:15-03:00" level=warning msg="+++++++++ NOTIFYING AWAITERS OF TASKS Test task on channel"
time="2024-08-28T15:26:15-03:00" level=warning msg="+++++++++ Pull: Waiting on the channel"
    worker_test.go:206:
                Error Trace:    D:/UP4W/cloudinit/windows-agent/internal/distros/worker/worker_test.go:206
                Error:          Condition never satisfied
                Test:           TestTaskProcessing/Success_executing_a_task
                Messages:       distro should have been "Running" after SubmitTask(). Current state is "Stopped"
time="2024-08-28T15:26:20-03:00" level=debug msg="Distro \"testDistro_UP4W_TestTaskProcessing_Success_executing_a_task_2931794278530498729\": stopping task processing"
time="2024-08-28T15:26:20-03:00" level=warning msg="+++++++++ Pull: inside the loop CTX DONE"
--- FAIL: TestTaskProcessing (0.00s)
    --- FAIL: TestTaskProcessing/Success_executing_a_task (5.02s)

Removing the select-default is not an option, test cases would remain blocked until timeout. In production that's less likely to happen, but still undesirable.

I opted into making the q.wait channel buffered, so some ammount of writters will succeed without blocking, giving the reader more chances to reach the channel and subsequent writers should have more chances of success as well.
I don't have a good explanation for the exact number of slots the channel should have.

I ran the worker tests 1367 times in my machine without a single failure, so this seems to be a suitable fix.

go test -v .\windows-agent\internal\distros\worker\ -tags=gowslmock -count=1367 -failfast -timeout=120m. :)

Before that I was experimenting failures quite often when running with --count=200 with any of the TestTaskProcessing or TestTaskDeduplication test cases.


wsl-pro-service

There is still one test failing sometimes: wsl-pro-service/internal/streams/server_test.go. The test assumes that when server.Stop() is called the gorountine serving the multistream service will exit with an error. It seems that the current implementation of the internal handlingLoop[Command].run() method connects the context objects used in the RPCs in a way that allows for its loop to interpret a server.Stop() as a graceful cancellation.

This method derives two context ojects mirroring the structure found in the Server type: gCtx for graceful stop and ctx for "brute forcing". Those local context objects added complexity and it seems we can get rid of them.

Additionally, there was an error in the server constructor NewServer causing the gracefulCtx to be a child of the ctx due a shadow declaration.

Per my current understanding this change in wsl-pro-service seems to not affect its runtime behaviour apart from responding to force-exit internal requests.


UDENG-3311

Copy link

codecov bot commented Aug 19, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 88.95%. Comparing base (7b95165) to head (4ed6408).
Report is 6 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #873      +/-   ##
==========================================
+ Coverage   88.87%   88.95%   +0.08%     
==========================================
  Files         103      105       +2     
  Lines        6812     6836      +24     
==========================================
+ Hits         6054     6081      +27     
- Misses        579      583       +4     
+ Partials      179      172       -7     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@CarlosNihelton CarlosNihelton changed the title fix(agent-tests): Give more slack in the require.Eventually calls. fix: Some flakyness in windows-agent and wsl-pro-service Aug 19, 2024
@CarlosNihelton CarlosNihelton marked this pull request as ready for review August 19, 2024 18:01
@CarlosNihelton CarlosNihelton requested a review from a team as a code owner August 19, 2024 18:01
CarlosNihelton added a commit that referenced this pull request Aug 20, 2024
This hides the Landscape config option (instead of disabling it). No
changes in behaviour nor in how we decide to hide it.
Some strings were changed also per guidance provided in #861. 
Finally I added a little bit more spacing in the top now that the
subtitle is one line only.

Here's how it looks like when compiled in release mode.


![image](https://github.com/user-attachments/assets/f9bb98aa-b544-44a2-82d2-6d5da2b186e0)

Closes #861 

---

I'm purposefully basing this PR on top of #873 so I can stress CI to
exercise the changes therein introduced. There is no dependency between
the two tasks.

---
UDENG-3818
CarlosNihelton added a commit that referenced this pull request Aug 21, 2024
Sometimes copy-pasting a pro token comes with trailing spaces. Neither
leading, trailing or even middle spaces are alloweable in a token. This
PR leverages Flutter builtin mechanisms to filter input supplied to the
token text field such that any kind of whitespaces (not only the ASCII
ones) are discarded before any further processing.

Closes #857 

---

I'm purposefully basing this PR on top of #873 so I can stress CI to
exercise the changes therein introduced. There is no dependency between
the two tasks.

---
UDENG-3810
CarlosNihelton and others added 5 commits August 28, 2024 11:53
TestNotifyConfigUpdate races with the distro worker:
1. It submits a config update to the distro
2. It attempts to read the pending tasks file
3. The distro worker may dequeue a task concurrent to that attempt.

See one example of such failure:
https://github.com/canonical/ubuntu-pro-for-wsl/actions/runs/10505222599/job/29102345493?pr=876

grep for 'Task_contains_empty_client_conf_when_both_are_empty'
You'll see that the task was started: ` starting task \"LandscapeConfigure\""`
Either success or failure, that ends up removing the task of the queue.
If that happens before we attempt to read the tasks file the test fails.

By calling distro.Cleanup() we effectively stop its worker, ensuring no
dequeueing and no racing.
I couldn't see a good explanation for races in the worker and task
management components that would match the pattern I observe in the
failed tests:
1. worker.SubmitTasks submits and immediate task.
2. the already running goroutine processing tasks is expected to find
   and dequeue that task.
3. require.Eventually asserts the side effects of that task.

(assuming the go runtime enforces memory sequence consistency when
mutexes are employed -- that's documented)

```
time="2024-07-26T15:14:24Z" level=debug msg="Distro \"testDistro_UP4W_TestTaskProcessing_Success_executing_a_task_7763261449570405205\": starting task processing"
time="2024-07-26T15:14:24Z" level=info msg="Distro \"testDistro_UP4W_TestTaskProcessing_Success_executing_a_task_7763261449570405205\": Submitting tasks [\"Test task\"] to queue"
time="2024-07-26T15:14:25Z" level=debug msg="Distro \"testDistro_UP4W_TestTaskProcessing_Success_executing_a_task_7763261449570405205\": stopping task processing"
```

The second log line confirms a synchronous call is made to
worker.Submit, and the failed assertion proves it returned.
Since it's synchronous the task was delivered.

Processing is asynchronous. The goroutine is running (maybe paused, but
that's up to the runtime). If the task was dequeued we'd see this line
in the log:

```
time="2024-07-26T15:14:23Z" level=debug msg="Distro \"testDistro_UP4W_TestTaskProcessing_Success_executing_a_task_7763261449570405205\": starting task \"Test task\""
```

So it seems that the goroutine didn't have a chance to come to
foreground to dequeue a task (pure speculation :) ).
Since I couldn't find a reliable way to reproduce this error, I'm in
favor of adding more time between require.Eventually calls, to see if
that increases the likelyhood of that goroutine being scheduled and
picking up the posted task.

If this is correct, it should address all the tests that commonly fail
in "Run Go Tests with mocks" CI jobs.
The first line of this method shadowed the context parameter,
that results in gracefulCtx being child of ctx,
which contributes to the spurious failures observed.

Co-authored-by: Didier Roche-Tolomelli <[email protected]>
We should use the stream's and server's contexts directly.
The subcontexts made the code obscure.
This aims to make clearer that forceful stop cancels the streams
contexts while the graceful stop allows the RPCs to finish their current
activities on their own before stopping the loops.

This method derived two context ojects mirroring the structure found in
the Server type: `gCtx` for graceful stop and `ctx` for "brute forcing".

They were arrenged in such a way that the local `gCtx` is child of
`ctx`. If the server own `ctx` is cancelled to force a stop, depending
on the moment that happens it could be interpreted by this method as a
graceful stop (because it only checks `gCtx` and that can be cancelled
due its parent being cancelled).

The whole point here is that `server.Stop()` is assumed in tests to
cause `server.Serve()` to return a non-nil error. But that can only
happen if the handling loop objects fail or perceive a brute-force
cancellation. If `handlingLooop[Command].run()` perceives `server.Stop()`
as a graceful cancellation, then they won't return errors.

Co-authored-by: Didier Roche-Tolomelli <[email protected]>
Tests has shown that's possible to have writers reaching the channel before any reader, thus the notification may never arrive.
The amount of writers is a best-guess at this moment and may be adjusted.
@CarlosNihelton CarlosNihelton marked this pull request as draft August 28, 2024 19:25
@CarlosNihelton CarlosNihelton marked this pull request as ready for review August 28, 2024 19:25
@CarlosNihelton CarlosNihelton self-assigned this Aug 28, 2024
Copy link
Collaborator

@didrocks didrocks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m unsure about the amount of writers, but I will trust your manual tests on it and we’ll see with more usage. Thanks for fixing those flakyness!

@CarlosNihelton CarlosNihelton merged commit 1b8dea7 into main Aug 29, 2024
38 checks passed
@CarlosNihelton CarlosNihelton deleted the fix-agent-flakyness branch August 29, 2024 12:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants