move setting to unhealthy after retries #579

iansuvak · 2024-12-05T14:35:20Z

Why this should be merged

Currently the healthcheck fatals immediately on failure. This means that for intermittent connection failures this leads to a race between the reconnect setting the status back to healthy before the healthcheck is called again. If the healthcheck is called before the reconnect then the process shuts down.

Alternative and potentially cleaner approach would be to not have the healthcheck failure shut down the process but let the kubernetes or whatever manages the process to shut it down based on the failed check and additional conditions/thresholds.

How this works

Doesn't set the status to unhealthy until the reconnect attempts have failed.

How this was tested

How is this documented

cam-schultz · 2024-12-05T15:11:06Z

relayer/listener.go

@@ -209,6 +208,7 @@ func (lstnr *Listener) processLogs(ctx context.Context) error {
 			// variables such as Quorum values and processing missed blocks.


I think we can get rid of this TODO. If implemented, we'd likely still want to mark the relayer as unhealthy after some period of time or number of failed retries. To me, there's no functional difference between that and simply retrying a fixed number of times before marking as unhealthy, as we do now.

I removed the TODO. Thinking about this more I think we should not fatal on the unhealthy status and let the outside caller of the binary decide what to do and how long to give it to become healthy again.

move setting to unhealthy after retries

e6c59ba

iansuvak requested a review from a team as a code owner December 5, 2024 14:35

iansuvak requested review from richardpringle, geoff-vball, bernard-avalabs, michaelkaplan13 and cam-schultz December 5, 2024 14:35

cam-schultz reviewed Dec 5, 2024

View reviewed changes

remove TODO

9e134e7

cam-schultz approved these changes Dec 5, 2024

View reviewed changes

geoff-vball approved these changes Dec 5, 2024

View reviewed changes

Merge branch 'main' into testnet-relayer-fix

2727604

iansuvak merged commit d95b36a into main Dec 6, 2024
8 checks passed

iansuvak deleted the testnet-relayer-fix branch December 6, 2024 13:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

move setting to unhealthy after retries #579

move setting to unhealthy after retries #579

iansuvak commented Dec 5, 2024

cam-schultz Dec 5, 2024

iansuvak Dec 5, 2024

		@@ -209,6 +208,7 @@ func (lstnr *Listener) processLogs(ctx context.Context) error {
		// variables such as Quorum values and processing missed blocks.

move setting to unhealthy after retries #579

move setting to unhealthy after retries #579

Conversation

iansuvak commented Dec 5, 2024

Why this should be merged

How this works

How this was tested

How is this documented

cam-schultz Dec 5, 2024

Choose a reason for hiding this comment

iansuvak Dec 5, 2024

Choose a reason for hiding this comment