cluster_has_replica: fix the way a healthy replica is detected #54

blogh · 2023-09-27T14:44:06Z

For patroni >= version 3.0.4:

the role is replica or sync_standby
the state is streaming
the lag is lower or equal to max_lag

For prio versions:

the role is replica or sync_standby
the state is running and with the same timeline has the leader
the lag is lower or equal to max_lag

blogh · 2023-09-27T14:44:20Z

cf #50

blogh · 2023-09-27T14:45:06Z

I still need to fix the tests and try on older supported python versions.

mbanck · 2023-09-30T07:35:06Z

If I read the changes correctly, this also adds the timeline to the perfdata? That might warrant a release notes item as well then.

blogh · 2023-10-02T09:59:12Z

You are right, I changed it. I'll probably continue next week.
I am booked for a client this week.

check_patroni/cluster.py

check_patroni/types.py

blogh · 2023-10-17T11:19:21Z

Hi @mbanck,

Do you want to review it ?

blogh · 2023-10-17T15:11:19Z

I think this is still wrong.

From PostgreSQL's perspective, a healthy standby could be streaming or in archive recovery
(we don't use slots and use log shipping to catchup). And if we look at is_healthiest_node or
is_failover_possible, Patroni doesn't care about the state of the node either (maybe I missed it ?)

It checks things like :

the timeline matches the leader's timeline (we do it only for patroni < 3.0.4)
the lag is lower or equal to maximum_lag_on_failover (we do it if --max-lag is used)
the nofailover tag is present (we don't check for that)
the watchdog is available (we don't check for that, and I think we can't do it from the API)
the cluster is not paused (we don't check for that here but there is a dedicated service for that)

So I think we should do something like

if version < 3.0.4:
   if state = "running" and TL = leader TL:
       test for lag if needed
       the node is healthy

if version >= 3.0.4:
   if state in ["streaming", "in archive recovery"] and TL = leader TL:
       test for lag if needed
       the node is healthy

I don't know what to do about nodes with a nofailover tag. Maybe exclude them if we
use a new --exclude-nofailover-tag option ?

mbanck · 2023-11-01T09:32:07Z

I guess in archive recovery means the standby is currently catching up; whether that is healthy or not could then be checked via lag. So I think the above is fine.

I am also not sure what to do about nofailover tags, but in my opinion, this is orthogonal to whether a node is healthy or not.

For patroni >= version 3.0.4: * the role is `replica` or `sync_standby` * the state is `streaming` or `in archive recovery` * the timeline is the same as the leader * the lag is lower or equal to `max_lag` For prio versions of patroni: * the role is `replica` or `sync_standby` * the state is `running` * the timeline is the same as the leader * the lag is lower or equal to `max_lag` Additionnally, we now display the timeline in the perfstats. We also try to display the perf stats of unhealthy replica as much as possible. Update tests for cluster_has_replica: * Fix the tests to make them work with the new algotithm * Add a specific test for tl divergences

blogh · 2023-11-09T16:55:51Z

@dlax could you have another look please ?

dlax

LGTM

blogh self-assigned this Sep 27, 2023

blogh added the bug Something isn't working label Sep 27, 2023

blogh force-pushed the rework_replica_states branch 2 times, most recently from c108093 to 341ac64 Compare September 27, 2023 14:51

mbanck mentioned this pull request Oct 6, 2023

Tests fail on Debian buster/bullseye and Ubuntu focal #51

Closed

blogh force-pushed the rework_replica_states branch 2 times, most recently from 3f10734 to 1b35de2 Compare October 13, 2023 15:32

blogh requested a review from dlax October 16, 2023 12:11

blogh force-pushed the rework_replica_states branch from d251383 to 0684e1e Compare October 16, 2023 12:44

dlax reviewed Oct 16, 2023

View reviewed changes

check_patroni/cluster.py Outdated Show resolved Hide resolved

check_patroni/types.py Outdated Show resolved Hide resolved

dlax approved these changes Oct 17, 2023

View reviewed changes

blogh force-pushed the rework_replica_states branch from 8a19b12 to 5e20507 Compare October 17, 2023 11:18

blogh marked this pull request as ready for review October 17, 2023 11:19

blogh force-pushed the rework_replica_states branch from 59d9bca to 8c15e72 Compare November 9, 2023 16:52

blogh requested a review from dlax November 10, 2023 10:15

dlax approved these changes Nov 10, 2023

View reviewed changes

blogh merged commit 8d6b850 into dalibo:master Nov 11, 2023
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cluster_has_replica: fix the way a healthy replica is detected #54

cluster_has_replica: fix the way a healthy replica is detected #54

blogh commented Sep 27, 2023

blogh commented Sep 27, 2023

blogh commented Sep 27, 2023 •

edited

Loading

mbanck commented Sep 30, 2023

blogh commented Oct 2, 2023

blogh commented Oct 17, 2023

blogh commented Oct 17, 2023

mbanck commented Nov 1, 2023

blogh commented Nov 9, 2023

dlax left a comment

cluster_has_replica: fix the way a healthy replica is detected #54

cluster_has_replica: fix the way a healthy replica is detected #54

Conversation

blogh commented Sep 27, 2023

blogh commented Sep 27, 2023

blogh commented Sep 27, 2023 • edited Loading

mbanck commented Sep 30, 2023

blogh commented Oct 2, 2023

blogh commented Oct 17, 2023

blogh commented Oct 17, 2023

mbanck commented Nov 1, 2023

blogh commented Nov 9, 2023

dlax left a comment

Choose a reason for hiding this comment

blogh commented Sep 27, 2023 •

edited

Loading