Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cluster_has_replica: fix the way a healthy replica is detected #54

Merged
merged 1 commit into from
Nov 11, 2023

Conversation

blogh
Copy link
Collaborator

@blogh blogh commented Sep 27, 2023

For patroni >= version 3.0.4:

  • the role is replica or sync_standby
  • the state is streaming
  • the lag is lower or equal to max_lag

For prio versions:

  • the role is replica or sync_standby
  • the state is running and with the same timeline has the leader
  • the lag is lower or equal to max_lag

@blogh
Copy link
Collaborator Author

blogh commented Sep 27, 2023

cf #50

@blogh
Copy link
Collaborator Author

blogh commented Sep 27, 2023

I still need to fix the tests and try on older supported python versions.

@blogh blogh self-assigned this Sep 27, 2023
@blogh blogh added the bug Something isn't working label Sep 27, 2023
@blogh blogh force-pushed the rework_replica_states branch 2 times, most recently from c108093 to 341ac64 Compare September 27, 2023 14:51
@mbanck
Copy link

mbanck commented Sep 30, 2023

If I read the changes correctly, this also adds the timeline to the perfdata? That might warrant a release notes item as well then.

@blogh
Copy link
Collaborator Author

blogh commented Oct 2, 2023

You are right, I changed it. I'll probably continue next week.
I am booked for a client this week.

check_patroni/cluster.py Outdated Show resolved Hide resolved
check_patroni/types.py Outdated Show resolved Hide resolved
@blogh
Copy link
Collaborator Author

blogh commented Oct 17, 2023

Hi @mbanck,

Do you want to review it ?

@blogh blogh marked this pull request as ready for review October 17, 2023 11:19
@blogh
Copy link
Collaborator Author

blogh commented Oct 17, 2023

I think this is still wrong.

From PostgreSQL's perspective, a healthy standby could be streaming or in archive recovery
(we don't use slots and use log shipping to catchup). And if we look at is_healthiest_node or
is_failover_possible, Patroni doesn't care about the state of the node either (maybe I missed it ?)

It checks things like :

  • the timeline matches the leader's timeline (we do it only for patroni < 3.0.4)
  • the lag is lower or equal to maximum_lag_on_failover (we do it if --max-lag is used)
  • the nofailover tag is present (we don't check for that)
  • the watchdog is available (we don't check for that, and I think we can't do it from the API)
  • the cluster is not paused (we don't check for that here but there is a dedicated service for that)

So I think we should do something like

if version < 3.0.4:
   if state = "running" and TL = leader TL:
       test for lag if needed
       the node is healthy

if version >= 3.0.4:
   if state in ["streaming", "in archive recovery"] and TL = leader TL:
       test for lag if needed
       the node is healthy

I don't know what to do about nodes with a nofailover tag. Maybe exclude them if we
use a new --exclude-nofailover-tag option ?

@mbanck
Copy link

mbanck commented Nov 1, 2023

I guess in archive recovery means the standby is currently catching up; whether that is healthy or not could then be checked via lag. So I think the above is fine.

I am also not sure what to do about nofailover tags, but in my opinion, this is orthogonal to whether a node is healthy or not.

For patroni >= version 3.0.4:
* the role is `replica` or `sync_standby`
* the state is `streaming` or `in archive recovery`
* the timeline is the same as the leader
* the lag is lower or equal to `max_lag`

For prio versions of patroni:
* the role is `replica` or `sync_standby`
* the state is `running`
* the timeline is the same as the leader
* the lag is lower or equal to `max_lag`

Additionnally, we now display the timeline in the perfstats. We also try
to display the perf stats of unhealthy replica as much as possible.

Update tests for cluster_has_replica:
* Fix the tests to make them work with the new algotithm
* Add a specific test for tl divergences
@blogh
Copy link
Collaborator Author

blogh commented Nov 9, 2023

@dlax could you have another look please ?

@blogh blogh requested a review from dlax November 10, 2023 10:15
Copy link
Member

@dlax dlax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@blogh blogh merged commit 8d6b850 into dalibo:master Nov 11, 2023
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants