Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

move setting to unhealthy after retries #579

Merged
merged 3 commits into from
Dec 6, 2024
Merged

Conversation

iansuvak
Copy link
Contributor

@iansuvak iansuvak commented Dec 5, 2024

Why this should be merged

Currently the healthcheck fatals immediately on failure. This means that for intermittent connection failures this leads to a race between the reconnect setting the status back to healthy before the healthcheck is called again. If the healthcheck is called before the reconnect then the process shuts down.

Alternative and potentially cleaner approach would be to not have the healthcheck failure shut down the process but let the kubernetes or whatever manages the process to shut it down based on the failed check and additional conditions/thresholds.

How this works

Doesn't set the status to unhealthy until the reconnect attempts have failed.

How this was tested

How is this documented

@@ -209,6 +208,7 @@ func (lstnr *Listener) processLogs(ctx context.Context) error {
// variables such as Quorum values and processing missed blocks.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can get rid of this TODO. If implemented, we'd likely still want to mark the relayer as unhealthy after some period of time or number of failed retries. To me, there's no functional difference between that and simply retrying a fixed number of times before marking as unhealthy, as we do now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the TODO. Thinking about this more I think we should not fatal on the unhealthy status and let the outside caller of the binary decide what to do and how long to give it to become healthy again.

@iansuvak iansuvak merged commit d95b36a into main Dec 6, 2024
8 checks passed
@iansuvak iansuvak deleted the testnet-relayer-fix branch December 6, 2024 13:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants