Refactor elector somewhat + test suite + poll-only mode #263

brandur · 2024-03-11T00:19:51Z

Here, do a little refactoring in the elector. The overall code doesn't
change that much, but we try to tighten things up with little things
like improved logging and real exponential backoff. It becomes a
start/stop service so that it's a little more normalized with other code
and more robust on start/stop.

The major improvement is the addition of a test suite. Although there
was a nominal suite before that was added during the driver refactor to
test the attemptElectOrReelect function, the elector was only tested
indirectly otherwise through the client. The change comes with a variety
of tests that exercise the elector's various behaviors, including one
that pits multiple competing electors against each other to make sure
that works.

Lastly, the elector gains a "poll only" mode in which we check that it
can still function using polling only in case a listener isn't
available. This doesn't have any effect on River feature-wise yet, but
the idea is that after we've added a similar capability to the producer,
we'll be able to support systems where LISTEN/NOTIFY aren't
available like PgBouncer in transaction mode, or possibly even MySQL or
SQLite in the future. We send the poll only mode through the same
barrage of tests that we require it to pass when using a database pool.

brandur · 2024-03-11T00:30:04Z

internal/notifier/notifier.go

@@ -442,6 +442,8 @@ func (n *Notifier) Listen(ctx context.Context, topic NotificationTopic, notifyFu
 	}
 	n.subscriptions[topic] = append(existingSubs, sub)

+	n.Logger.InfoContext(ctx, n.Name+": Added subscription", "new_num_subscriptions", len(n.subscriptions[topic]), "topic", topic)


Found having a little more logging on subscriptions pretty handy for test debugging purposes.

This one also feels like it should maybe be a Debug level given it's indicative of healthy operation and pretty low level, thoughts?

K, made those two debug too.

Follow up changes like #253, #262, and #263 to make the producer a start/stop service, giving it a more predictable way to invoke start and stop, making it safer to run and cleaning up caller code where it's used in the client and test cases. With all these changes taken together we'll have every service in the client using the same unified service interface, which will clean up code and let us write some neat utilities to operate across all of them. Aside from that, we clean up the producer in ways to bring it more inline with other code, like making logging uniform and having the constructor return only a `*producer` instead of `(*producer, error)` that needs to be checked despite an error always being indicative of a bug in this context. We expand the test suite, adding tests like (1) verifying that workers are really stopped when `workCtx` is cancelled, (2) verifying that the max worker slots work as expected and that the producer limits its fetches, and (3) start/stop stress. Like with #263, we give the producer a poll only mode, which also gets the full test barrage using a shared test transaction instead of full database pool. Also like #263, this poll only mode is still prospective and not yet put to full use (although it will be soon).

bgentry

Couple of minor comments, but this looks really fantastic. Great work 🎉

bgentry · 2024-03-12T01:21:22Z

internal/leadership/elector.go

+				// We only care about resignations on because we use them to preempt the
+				// election attempt backoff. And we only care about our own key name.


typo here:

We only care about resignations on because

bgentry · 2024-03-12T01:26:52Z

internal/leadership/elector.go

 		}

+		numErrors = 0
+
+		e.Logger.Info(e.Name+": Leadership bid was unsuccessful (not an error)", "client_id", e.clientID)


Info level might be a bit noisy for this running every 5 seconds on every worker and it being a completely benign/expected scenario, maybe Debug?

Suggested change

e.Logger.Info(e.Name+": Leadership bid was unsuccessful (not an error)", "client_id", e.clientID)

e.Logger.Debug(e.Name+": Leadership bid was unsuccessful (not an error)", "client_id", e.clientID)

I'll say that because we probably don't expect that many different River clients operating in one installation, so it's not like these will be getting endlessly spewed.

That said, now that the tests are all fixed up, I can live with debug, so changed.

Also, I changed everything to InfoContext/DebugContext which I'd meant to do but forgot.

bgentry · 2024-03-12T01:27:39Z

internal/leadership/elector.go

-			// of resignations. May want to make this reusable & cancel it when retrying?
-		case <-leadershipNotificationChan:
-			// Somebody just resigned, try to win the next election immediately.
+		case <-e.CancellableSleepRandomBetweenC(ctx, e.electInterval, e.electInterval+e.electIntervalJitter):


there's a potential leaking of timers here, but only if resignations are happening frequently

Yeah, good point. I just added the comment back that was previously there.

I think it might be kind of neat if we could do a timeutil utility exactly like Ticker but which would tick within a random range. I'll put it on my TODO list.

bgentry · 2024-03-12T01:29:16Z

internal/leadership/elector.go

+				if !errors.Is(err, errLostLeadership) {
+					e.Logger.Error(e.Name+": Error keeping leadership", "client_id", e.clientID, "err", err)


Is this inverted? The message seems to align with when the node lost its leadership

A little hard to read, but the idea is to log any error that's not a reelection bid loss.

Changed the code so that:

Inverted the condition and put in a continue, with the log statement falling to below the conditional.

Renamed the error errLostLeadershipReelection for better clarity.

bgentry · 2024-03-12T01:32:21Z

internal/notifier/notifier.go

@@ -442,6 +442,8 @@ func (n *Notifier) Listen(ctx context.Context, topic NotificationTopic, notifyFu
 	}
 	n.subscriptions[topic] = append(existingSubs, sub)

+	n.Logger.InfoContext(ctx, n.Name+": Added subscription", "new_num_subscriptions", len(n.subscriptions[topic]), "topic", topic)


This one also feels like it should maybe be a Debug level given it's indicative of healthy operation and pretty low level, thoughts?

Follow up changes like #253, #262, and #263 to make the producer a start/stop service, giving it a more predictable way to invoke start and stop, making it safer to run and cleaning up caller code where it's used in the client and test cases. With all these changes taken together we'll have every service in the client using the same unified service interface, which will clean up code and let us write some neat utilities to operate across all of them. Aside from that, we clean up the producer in ways to bring it more inline with other code, like making logging uniform and having the constructor return only a `*producer` instead of `(*producer, error)` that needs to be checked despite an error always being indicative of a bug in this context. We expand the test suite, adding tests like (1) verifying that workers are really stopped when `workCtx` is cancelled, (2) verifying that the max worker slots work as expected and that the producer limits its fetches, and (3) start/stop stress. Like with #263, we give the producer a poll only mode, which also gets the full test barrage using a shared test transaction instead of full database pool. Also like #263, this poll only mode is still prospective and not yet put to full use (although it will be soon).

Here, do a little refactoring in the elector. The overall code doesn't change that much, but we try to tighten things up with little things like improved logging and real exponential backoff. It becomes a start/stop service so that it's a little more normalized with other code and more robust on start/stop. The major improvement is the addition of a test suite. Although there was a nominal suite before that was added during the driver refactor to test the `attemptElectOrReelect` function, the elector was only tested indirectly otherwise through the client. The change comes with a variety of tests that exercise the elector's various behaviors, including one that pits multiple competing electors against each other to make sure that works. Lastly, the elector gains a "poll only" mode in which we check that it can still function using polling only in case a listener isn't available. This doesn't have any effect on River feature-wise yet, but the idea is that after we've added a similar capability to the producer, we'll be able to support systems where `LISTEN`/`NOTIFY` aren't available like PgBouncer in transaction mode, or possibly even MySQL or SQLite in the future. We send the poll only mode through the same barrage of tests that we require it to pass when using a database pool.

brandur · 2024-03-12T03:01:12Z

Thanks!

Follow up changes like #253, #262, and #263 to make the producer a start/stop service, giving it a more predictable way to invoke start and stop, making it safer to run and cleaning up caller code where it's used in the client and test cases. With all these changes taken together we'll have every service in the client using the same unified service interface, which will clean up code and let us write some neat utilities to operate across all of them. Aside from that, we clean up the producer in ways to bring it more inline with other code, like making logging uniform and having the constructor return only a `*producer` instead of `(*producer, error)` that needs to be checked despite an error always being indicative of a bug in this context. We expand the test suite, adding tests like (1) verifying that workers are really stopped when `workCtx` is cancelled, (2) verifying that the max worker slots work as expected and that the producer limits its fetches, and (3) start/stop stress. Like with #263, we give the producer a poll only mode, which also gets the full test barrage using a shared test transaction instead of full database pool. Also like #263, this poll only mode is still prospective and not yet put to full use (although it will be soon).

) Follow up changes like #253, #262, and #263 to make the producer a start/stop service, giving it a more predictable way to invoke start and stop, making it safer to run and cleaning up caller code where it's used in the client and test cases. With all these changes taken together we'll have every service in the client using the same unified service interface, which will clean up code and let us write some neat utilities to operate across all of them. Aside from that, we clean up the producer in ways to bring it more inline with other code, like making logging uniform and having the constructor return only a `*producer` instead of `(*producer, error)` that needs to be checked despite an error always being indicative of a bug in this context. We expand the test suite, adding tests like (1) verifying that workers are really stopped when `workCtx` is cancelled, (2) verifying that the max worker slots work as expected and that the producer limits its fetches, and (3) start/stop stress. Like with #263, we give the producer a poll only mode, which also gets the full test barrage using a shared test transaction instead of full database pool. Also like #263, this poll only mode is still prospective and not yet put to full use (although it will be soon).

brandur force-pushed the brandur-elector-refactor-and-tests branch from 8f5f195 to d36c147 Compare March 11, 2024 00:25

brandur commented Mar 11, 2024

View reviewed changes

brandur force-pushed the brandur-elector-refactor-and-tests branch from d36c147 to cf45da5 Compare March 11, 2024 00:30

brandur requested a review from bgentry March 11, 2024 00:32

brandur mentioned this pull request Mar 11, 2024

Make producer start/stop service + poll-only mode + expanded tests #264

Merged

brandur force-pushed the brandur-elector-refactor-and-tests branch 2 times, most recently from 214b844 to ff8dd55 Compare March 12, 2024 00:56

bgentry approved these changes Mar 12, 2024

View reviewed changes

brandur force-pushed the brandur-elector-refactor-and-tests branch from ff8dd55 to be04dc5 Compare March 12, 2024 02:53

brandur force-pushed the brandur-elector-refactor-and-tests branch from be04dc5 to 2e674d6 Compare March 12, 2024 02:59

brandur merged commit 3e7d87d into master Mar 12, 2024
10 checks passed

brandur deleted the brandur-elector-refactor-and-tests branch March 12, 2024 03:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor elector somewhat + test suite + poll-only mode #263

Refactor elector somewhat + test suite + poll-only mode #263

brandur commented Mar 11, 2024

brandur Mar 11, 2024

bgentry Mar 12, 2024

brandur Mar 12, 2024

bgentry left a comment

bgentry Mar 12, 2024

brandur Mar 12, 2024

bgentry Mar 12, 2024

brandur Mar 12, 2024

bgentry Mar 12, 2024

brandur Mar 12, 2024

bgentry Mar 12, 2024

brandur Mar 12, 2024

bgentry Mar 12, 2024

brandur commented Mar 12, 2024

		// We only care about resignations on because we use them to preempt the
		// election attempt backoff. And we only care about our own key name.

	e.Logger.Info(e.Name+": Leadership bid was unsuccessful (not an error)", "client_id", e.clientID)
	e.Logger.Debug(e.Name+": Leadership bid was unsuccessful (not an error)", "client_id", e.clientID)

		if !errors.Is(err, errLostLeadership) {
		e.Logger.Error(e.Name+": Error keeping leadership", "client_id", e.clientID, "err", err)

Refactor elector somewhat + test suite + poll-only mode #263

Refactor elector somewhat + test suite + poll-only mode #263

Conversation

brandur commented Mar 11, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bgentry left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brandur commented Mar 12, 2024