-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor elector somewhat + test suite + poll-only mode #263
Conversation
8f5f195
to
d36c147
Compare
internal/notifier/notifier.go
Outdated
@@ -442,6 +442,8 @@ func (n *Notifier) Listen(ctx context.Context, topic NotificationTopic, notifyFu | |||
} | |||
n.subscriptions[topic] = append(existingSubs, sub) | |||
|
|||
n.Logger.InfoContext(ctx, n.Name+": Added subscription", "new_num_subscriptions", len(n.subscriptions[topic]), "topic", topic) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found having a little more logging on subscriptions pretty handy for test debugging purposes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one also feels like it should maybe be a Debug
level given it's indicative of healthy operation and pretty low level, thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
K, made those two debug too.
d36c147
to
cf45da5
Compare
Follow up changes like #253, #262, and #263 to make the producer a start/stop service, giving it a more predictable way to invoke start and stop, making it safer to run and cleaning up caller code where it's used in the client and test cases. With all these changes taken together we'll have every service in the client using the same unified service interface, which will clean up code and let us write some neat utilities to operate across all of them. Aside from that, we clean up the producer in ways to bring it more inline with other code, like making logging uniform and having the constructor return only a `*producer` instead of `(*producer, error)` that needs to be checked despite an error always being indicative of a bug in this context. We expand the test suite, adding tests like (1) verifying that workers are really stopped when `workCtx` is cancelled, (2) verifying that the max worker slots work as expected and that the producer limits its fetches, and (3) start/stop stress. Like with #263, we give the producer a poll only mode, which also gets the full test barrage using a shared test transaction instead of full database pool. Also like #263, this poll only mode is still prospective and not yet put to full use (although it will be soon).
Follow up changes like #253, #262, and #263 to make the producer a start/stop service, giving it a more predictable way to invoke start and stop, making it safer to run and cleaning up caller code where it's used in the client and test cases. With all these changes taken together we'll have every service in the client using the same unified service interface, which will clean up code and let us write some neat utilities to operate across all of them. Aside from that, we clean up the producer in ways to bring it more inline with other code, like making logging uniform and having the constructor return only a `*producer` instead of `(*producer, error)` that needs to be checked despite an error always being indicative of a bug in this context. We expand the test suite, adding tests like (1) verifying that workers are really stopped when `workCtx` is cancelled, (2) verifying that the max worker slots work as expected and that the producer limits its fetches, and (3) start/stop stress. Like with #263, we give the producer a poll only mode, which also gets the full test barrage using a shared test transaction instead of full database pool. Also like #263, this poll only mode is still prospective and not yet put to full use (although it will be soon).
Follow up changes like #253, #262, and #263 to make the producer a start/stop service, giving it a more predictable way to invoke start and stop, making it safer to run and cleaning up caller code where it's used in the client and test cases. With all these changes taken together we'll have every service in the client using the same unified service interface, which will clean up code and let us write some neat utilities to operate across all of them. Aside from that, we clean up the producer in ways to bring it more inline with other code, like making logging uniform and having the constructor return only a `*producer` instead of `(*producer, error)` that needs to be checked despite an error always being indicative of a bug in this context. We expand the test suite, adding tests like (1) verifying that workers are really stopped when `workCtx` is cancelled, (2) verifying that the max worker slots work as expected and that the producer limits its fetches, and (3) start/stop stress. Like with #263, we give the producer a poll only mode, which also gets the full test barrage using a shared test transaction instead of full database pool. Also like #263, this poll only mode is still prospective and not yet put to full use (although it will be soon).
214b844
to
ff8dd55
Compare
Follow up changes like #253, #262, and #263 to make the producer a start/stop service, giving it a more predictable way to invoke start and stop, making it safer to run and cleaning up caller code where it's used in the client and test cases. With all these changes taken together we'll have every service in the client using the same unified service interface, which will clean up code and let us write some neat utilities to operate across all of them. Aside from that, we clean up the producer in ways to bring it more inline with other code, like making logging uniform and having the constructor return only a `*producer` instead of `(*producer, error)` that needs to be checked despite an error always being indicative of a bug in this context. We expand the test suite, adding tests like (1) verifying that workers are really stopped when `workCtx` is cancelled, (2) verifying that the max worker slots work as expected and that the producer limits its fetches, and (3) start/stop stress. Like with #263, we give the producer a poll only mode, which also gets the full test barrage using a shared test transaction instead of full database pool. Also like #263, this poll only mode is still prospective and not yet put to full use (although it will be soon).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple of minor comments, but this looks really fantastic. Great work 🎉
internal/leadership/elector.go
Outdated
// We only care about resignations on because we use them to preempt the | ||
// election attempt backoff. And we only care about our own key name. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo here:
We only care about resignations on because
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thx.
internal/leadership/elector.go
Outdated
} | ||
|
||
numErrors = 0 | ||
|
||
e.Logger.Info(e.Name+": Leadership bid was unsuccessful (not an error)", "client_id", e.clientID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Info
level might be a bit noisy for this running every 5 seconds on every worker and it being a completely benign/expected scenario, maybe Debug
?
e.Logger.Info(e.Name+": Leadership bid was unsuccessful (not an error)", "client_id", e.clientID) | |
e.Logger.Debug(e.Name+": Leadership bid was unsuccessful (not an error)", "client_id", e.clientID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll say that because we probably don't expect that many different River clients operating in one installation, so it's not like these will be getting endlessly spewed.
That said, now that the tests are all fixed up, I can live with debug, so changed.
Also, I changed everything to InfoContext
/DebugContext
which I'd meant to do but forgot.
// of resignations. May want to make this reusable & cancel it when retrying? | ||
case <-leadershipNotificationChan: | ||
// Somebody just resigned, try to win the next election immediately. | ||
case <-e.CancellableSleepRandomBetweenC(ctx, e.electInterval, e.electInterval+e.electIntervalJitter): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's a potential leaking of timers here, but only if resignations are happening frequently
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, good point. I just added the comment back that was previously there.
I think it might be kind of neat if we could do a timeutil
utility exactly like Ticker
but which would tick within a random range. I'll put it on my TODO list.
internal/leadership/elector.go
Outdated
if !errors.Is(err, errLostLeadership) { | ||
e.Logger.Error(e.Name+": Error keeping leadership", "client_id", e.clientID, "err", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this inverted? The message seems to align with when the node lost its leadership
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A little hard to read, but the idea is to log any error that's not a reelection bid loss.
Changed the code so that:
- Inverted the condition and put in a
continue
, with the log statement falling to below the conditional. - Renamed the error
errLostLeadershipReelection
for better clarity.
internal/notifier/notifier.go
Outdated
@@ -442,6 +442,8 @@ func (n *Notifier) Listen(ctx context.Context, topic NotificationTopic, notifyFu | |||
} | |||
n.subscriptions[topic] = append(existingSubs, sub) | |||
|
|||
n.Logger.InfoContext(ctx, n.Name+": Added subscription", "new_num_subscriptions", len(n.subscriptions[topic]), "topic", topic) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one also feels like it should maybe be a Debug
level given it's indicative of healthy operation and pretty low level, thoughts?
ff8dd55
to
be04dc5
Compare
Follow up changes like #253, #262, and #263 to make the producer a start/stop service, giving it a more predictable way to invoke start and stop, making it safer to run and cleaning up caller code where it's used in the client and test cases. With all these changes taken together we'll have every service in the client using the same unified service interface, which will clean up code and let us write some neat utilities to operate across all of them. Aside from that, we clean up the producer in ways to bring it more inline with other code, like making logging uniform and having the constructor return only a `*producer` instead of `(*producer, error)` that needs to be checked despite an error always being indicative of a bug in this context. We expand the test suite, adding tests like (1) verifying that workers are really stopped when `workCtx` is cancelled, (2) verifying that the max worker slots work as expected and that the producer limits its fetches, and (3) start/stop stress. Like with #263, we give the producer a poll only mode, which also gets the full test barrage using a shared test transaction instead of full database pool. Also like #263, this poll only mode is still prospective and not yet put to full use (although it will be soon).
Here, do a little refactoring in the elector. The overall code doesn't change that much, but we try to tighten things up with little things like improved logging and real exponential backoff. It becomes a start/stop service so that it's a little more normalized with other code and more robust on start/stop. The major improvement is the addition of a test suite. Although there was a nominal suite before that was added during the driver refactor to test the `attemptElectOrReelect` function, the elector was only tested indirectly otherwise through the client. The change comes with a variety of tests that exercise the elector's various behaviors, including one that pits multiple competing electors against each other to make sure that works. Lastly, the elector gains a "poll only" mode in which we check that it can still function using polling only in case a listener isn't available. This doesn't have any effect on River feature-wise yet, but the idea is that after we've added a similar capability to the producer, we'll be able to support systems where `LISTEN`/`NOTIFY` aren't available like PgBouncer in transaction mode, or possibly even MySQL or SQLite in the future. We send the poll only mode through the same barrage of tests that we require it to pass when using a database pool.
be04dc5
to
2e674d6
Compare
Thanks! |
Follow up changes like #253, #262, and #263 to make the producer a start/stop service, giving it a more predictable way to invoke start and stop, making it safer to run and cleaning up caller code where it's used in the client and test cases. With all these changes taken together we'll have every service in the client using the same unified service interface, which will clean up code and let us write some neat utilities to operate across all of them. Aside from that, we clean up the producer in ways to bring it more inline with other code, like making logging uniform and having the constructor return only a `*producer` instead of `(*producer, error)` that needs to be checked despite an error always being indicative of a bug in this context. We expand the test suite, adding tests like (1) verifying that workers are really stopped when `workCtx` is cancelled, (2) verifying that the max worker slots work as expected and that the producer limits its fetches, and (3) start/stop stress. Like with #263, we give the producer a poll only mode, which also gets the full test barrage using a shared test transaction instead of full database pool. Also like #263, this poll only mode is still prospective and not yet put to full use (although it will be soon).
Follow up changes like #253, #262, and #263 to make the producer a start/stop service, giving it a more predictable way to invoke start and stop, making it safer to run and cleaning up caller code where it's used in the client and test cases. With all these changes taken together we'll have every service in the client using the same unified service interface, which will clean up code and let us write some neat utilities to operate across all of them. Aside from that, we clean up the producer in ways to bring it more inline with other code, like making logging uniform and having the constructor return only a `*producer` instead of `(*producer, error)` that needs to be checked despite an error always being indicative of a bug in this context. We expand the test suite, adding tests like (1) verifying that workers are really stopped when `workCtx` is cancelled, (2) verifying that the max worker slots work as expected and that the producer limits its fetches, and (3) start/stop stress. Like with #263, we give the producer a poll only mode, which also gets the full test barrage using a shared test transaction instead of full database pool. Also like #263, this poll only mode is still prospective and not yet put to full use (although it will be soon).
Follow up changes like #253, #262, and #263 to make the producer a start/stop service, giving it a more predictable way to invoke start and stop, making it safer to run and cleaning up caller code where it's used in the client and test cases. With all these changes taken together we'll have every service in the client using the same unified service interface, which will clean up code and let us write some neat utilities to operate across all of them. Aside from that, we clean up the producer in ways to bring it more inline with other code, like making logging uniform and having the constructor return only a `*producer` instead of `(*producer, error)` that needs to be checked despite an error always being indicative of a bug in this context. We expand the test suite, adding tests like (1) verifying that workers are really stopped when `workCtx` is cancelled, (2) verifying that the max worker slots work as expected and that the producer limits its fetches, and (3) start/stop stress. Like with #263, we give the producer a poll only mode, which also gets the full test barrage using a shared test transaction instead of full database pool. Also like #263, this poll only mode is still prospective and not yet put to full use (although it will be soon).
Follow up changes like #253, #262, and #263 to make the producer a start/stop service, giving it a more predictable way to invoke start and stop, making it safer to run and cleaning up caller code where it's used in the client and test cases. With all these changes taken together we'll have every service in the client using the same unified service interface, which will clean up code and let us write some neat utilities to operate across all of them. Aside from that, we clean up the producer in ways to bring it more inline with other code, like making logging uniform and having the constructor return only a `*producer` instead of `(*producer, error)` that needs to be checked despite an error always being indicative of a bug in this context. We expand the test suite, adding tests like (1) verifying that workers are really stopped when `workCtx` is cancelled, (2) verifying that the max worker slots work as expected and that the producer limits its fetches, and (3) start/stop stress. Like with #263, we give the producer a poll only mode, which also gets the full test barrage using a shared test transaction instead of full database pool. Also like #263, this poll only mode is still prospective and not yet put to full use (although it will be soon).
Follow up changes like #253, #262, and #263 to make the producer a start/stop service, giving it a more predictable way to invoke start and stop, making it safer to run and cleaning up caller code where it's used in the client and test cases. With all these changes taken together we'll have every service in the client using the same unified service interface, which will clean up code and let us write some neat utilities to operate across all of them. Aside from that, we clean up the producer in ways to bring it more inline with other code, like making logging uniform and having the constructor return only a `*producer` instead of `(*producer, error)` that needs to be checked despite an error always being indicative of a bug in this context. We expand the test suite, adding tests like (1) verifying that workers are really stopped when `workCtx` is cancelled, (2) verifying that the max worker slots work as expected and that the producer limits its fetches, and (3) start/stop stress. Like with #263, we give the producer a poll only mode, which also gets the full test barrage using a shared test transaction instead of full database pool. Also like #263, this poll only mode is still prospective and not yet put to full use (although it will be soon).
Follow up changes like #253, #262, and #263 to make the producer a start/stop service, giving it a more predictable way to invoke start and stop, making it safer to run and cleaning up caller code where it's used in the client and test cases. With all these changes taken together we'll have every service in the client using the same unified service interface, which will clean up code and let us write some neat utilities to operate across all of them. Aside from that, we clean up the producer in ways to bring it more inline with other code, like making logging uniform and having the constructor return only a `*producer` instead of `(*producer, error)` that needs to be checked despite an error always being indicative of a bug in this context. We expand the test suite, adding tests like (1) verifying that workers are really stopped when `workCtx` is cancelled, (2) verifying that the max worker slots work as expected and that the producer limits its fetches, and (3) start/stop stress. Like with #263, we give the producer a poll only mode, which also gets the full test barrage using a shared test transaction instead of full database pool. Also like #263, this poll only mode is still prospective and not yet put to full use (although it will be soon).
Follow up changes like #253, #262, and #263 to make the producer a start/stop service, giving it a more predictable way to invoke start and stop, making it safer to run and cleaning up caller code where it's used in the client and test cases. With all these changes taken together we'll have every service in the client using the same unified service interface, which will clean up code and let us write some neat utilities to operate across all of them. Aside from that, we clean up the producer in ways to bring it more inline with other code, like making logging uniform and having the constructor return only a `*producer` instead of `(*producer, error)` that needs to be checked despite an error always being indicative of a bug in this context. We expand the test suite, adding tests like (1) verifying that workers are really stopped when `workCtx` is cancelled, (2) verifying that the max worker slots work as expected and that the producer limits its fetches, and (3) start/stop stress. Like with #263, we give the producer a poll only mode, which also gets the full test barrage using a shared test transaction instead of full database pool. Also like #263, this poll only mode is still prospective and not yet put to full use (although it will be soon).
Follow up changes like #253, #262, and #263 to make the producer a start/stop service, giving it a more predictable way to invoke start and stop, making it safer to run and cleaning up caller code where it's used in the client and test cases. With all these changes taken together we'll have every service in the client using the same unified service interface, which will clean up code and let us write some neat utilities to operate across all of them. Aside from that, we clean up the producer in ways to bring it more inline with other code, like making logging uniform and having the constructor return only a `*producer` instead of `(*producer, error)` that needs to be checked despite an error always being indicative of a bug in this context. We expand the test suite, adding tests like (1) verifying that workers are really stopped when `workCtx` is cancelled, (2) verifying that the max worker slots work as expected and that the producer limits its fetches, and (3) start/stop stress. Like with #263, we give the producer a poll only mode, which also gets the full test barrage using a shared test transaction instead of full database pool. Also like #263, this poll only mode is still prospective and not yet put to full use (although it will be soon).
Follow up changes like #253, #262, and #263 to make the producer a start/stop service, giving it a more predictable way to invoke start and stop, making it safer to run and cleaning up caller code where it's used in the client and test cases. With all these changes taken together we'll have every service in the client using the same unified service interface, which will clean up code and let us write some neat utilities to operate across all of them. Aside from that, we clean up the producer in ways to bring it more inline with other code, like making logging uniform and having the constructor return only a `*producer` instead of `(*producer, error)` that needs to be checked despite an error always being indicative of a bug in this context. We expand the test suite, adding tests like (1) verifying that workers are really stopped when `workCtx` is cancelled, (2) verifying that the max worker slots work as expected and that the producer limits its fetches, and (3) start/stop stress. Like with #263, we give the producer a poll only mode, which also gets the full test barrage using a shared test transaction instead of full database pool. Also like #263, this poll only mode is still prospective and not yet put to full use (although it will be soon).
Follow up changes like #253, #262, and #263 to make the producer a start/stop service, giving it a more predictable way to invoke start and stop, making it safer to run and cleaning up caller code where it's used in the client and test cases. With all these changes taken together we'll have every service in the client using the same unified service interface, which will clean up code and let us write some neat utilities to operate across all of them. Aside from that, we clean up the producer in ways to bring it more inline with other code, like making logging uniform and having the constructor return only a `*producer` instead of `(*producer, error)` that needs to be checked despite an error always being indicative of a bug in this context. We expand the test suite, adding tests like (1) verifying that workers are really stopped when `workCtx` is cancelled, (2) verifying that the max worker slots work as expected and that the producer limits its fetches, and (3) start/stop stress. Like with #263, we give the producer a poll only mode, which also gets the full test barrage using a shared test transaction instead of full database pool. Also like #263, this poll only mode is still prospective and not yet put to full use (although it will be soon).
Follow up changes like #253, #262, and #263 to make the producer a start/stop service, giving it a more predictable way to invoke start and stop, making it safer to run and cleaning up caller code where it's used in the client and test cases. With all these changes taken together we'll have every service in the client using the same unified service interface, which will clean up code and let us write some neat utilities to operate across all of them. Aside from that, we clean up the producer in ways to bring it more inline with other code, like making logging uniform and having the constructor return only a `*producer` instead of `(*producer, error)` that needs to be checked despite an error always being indicative of a bug in this context. We expand the test suite, adding tests like (1) verifying that workers are really stopped when `workCtx` is cancelled, (2) verifying that the max worker slots work as expected and that the producer limits its fetches, and (3) start/stop stress. Like with #263, we give the producer a poll only mode, which also gets the full test barrage using a shared test transaction instead of full database pool. Also like #263, this poll only mode is still prospective and not yet put to full use (although it will be soon).
Follow up changes like #253, #262, and #263 to make the producer a start/stop service, giving it a more predictable way to invoke start and stop, making it safer to run and cleaning up caller code where it's used in the client and test cases. With all these changes taken together we'll have every service in the client using the same unified service interface, which will clean up code and let us write some neat utilities to operate across all of them. Aside from that, we clean up the producer in ways to bring it more inline with other code, like making logging uniform and having the constructor return only a `*producer` instead of `(*producer, error)` that needs to be checked despite an error always being indicative of a bug in this context. We expand the test suite, adding tests like (1) verifying that workers are really stopped when `workCtx` is cancelled, (2) verifying that the max worker slots work as expected and that the producer limits its fetches, and (3) start/stop stress. Like with #263, we give the producer a poll only mode, which also gets the full test barrage using a shared test transaction instead of full database pool. Also like #263, this poll only mode is still prospective and not yet put to full use (although it will be soon).
) Follow up changes like #253, #262, and #263 to make the producer a start/stop service, giving it a more predictable way to invoke start and stop, making it safer to run and cleaning up caller code where it's used in the client and test cases. With all these changes taken together we'll have every service in the client using the same unified service interface, which will clean up code and let us write some neat utilities to operate across all of them. Aside from that, we clean up the producer in ways to bring it more inline with other code, like making logging uniform and having the constructor return only a `*producer` instead of `(*producer, error)` that needs to be checked despite an error always being indicative of a bug in this context. We expand the test suite, adding tests like (1) verifying that workers are really stopped when `workCtx` is cancelled, (2) verifying that the max worker slots work as expected and that the producer limits its fetches, and (3) start/stop stress. Like with #263, we give the producer a poll only mode, which also gets the full test barrage using a shared test transaction instead of full database pool. Also like #263, this poll only mode is still prospective and not yet put to full use (although it will be soon).
Here, do a little refactoring in the elector. The overall code doesn't
change that much, but we try to tighten things up with little things
like improved logging and real exponential backoff. It becomes a
start/stop service so that it's a little more normalized with other code
and more robust on start/stop.
The major improvement is the addition of a test suite. Although there
was a nominal suite before that was added during the driver refactor to
test the
attemptElectOrReelect
function, the elector was only testedindirectly otherwise through the client. The change comes with a variety
of tests that exercise the elector's various behaviors, including one
that pits multiple competing electors against each other to make sure
that works.
Lastly, the elector gains a "poll only" mode in which we check that it
can still function using polling only in case a listener isn't
available. This doesn't have any effect on River feature-wise yet, but
the idea is that after we've added a similar capability to the producer,
we'll be able to support systems where
LISTEN
/NOTIFY
aren'tavailable like PgBouncer in transaction mode, or possibly even MySQL or
SQLite in the future. We send the poll only mode through the same
barrage of tests that we require it to pass when using a database pool.