Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The charm allows a 2-node cluster but it's not functional after a failover #570

Open
nobuto-m opened this issue Aug 5, 2024 · 4 comments
Labels
bug Something isn't working

Comments

@nobuto-m
Copy link

nobuto-m commented Aug 5, 2024

Steps to reproduce

  1. Prepare a MAAS provider
  2. deploy the charm with 2 units by following https://charmhub.io/postgresql/docs/h-scale
    juju deploy postgresql --base [email protected] --channel 14/stable -n 2
  3. take down the primary unit

Expected behavior

It's either:

  • keep functional after taking down one of the two units
  • or prevent a two-node cluster from being deployed by making juju status blocked by suggesting 3 units instead

Actual behavior

Similar topic with #566.

Juju status looks okay at a glance. However, the living unit doesn't say which unit is the primary at the moment.

$ juju status
Model     Controller            Cloud/Region       Version  SLA          Timestamp
postgres  mysunbeam-controller  mysunbeam/default  3.5.3    unsupported  12:17:40Z

App         Version  Status  Scale  Charm       Channel    Rev  Exposed  Message
postgresql  14.11    active    1/2  postgresql  14/stable  429  no       

Unit           Workload  Agent  Machine  Public address   Ports     Message
postgresql/0   unknown   lost   0        192.168.151.115  5432/tcp  agent lost, see 'juju show-status-log postgresql/0'
postgresql/1*  active    idle   1        192.168.151.116  5432/tcp  

Machine  State    Address          Inst id    Base          AZ       Message
0        down     192.168.151.115  machine-7  [email protected]  default  Deployed
1        started  192.168.151.116  machine-8  [email protected]  default  Deployed

Also, the action states the dead unit is the primary, which shouldn't be.

$ juju run postgresql/leader get-primary
Running operation 3 with 1 task
  - task 4 on unit-postgresql-1

Waiting for task 4...
primary: postgresql/0

The patroni's member list cannot be fetched since the quorum of the raft was lost.

$ juju ssh postgresql/1 -- sudo -u snap_daemon env PATRONI_LOG_LEVEL=DEBUG patronictl -c /var/snap/charmed-postgresql/current/etc/patroni/patroni.yaml list
2024-08-05 12:20:16,176 - DEBUG - Loading configuration from file /var/snap/charmed-postgresql/current/etc/patroni/patroni.yaml
2024-08-05 12:20:21,243 - INFO - waiting on raft
2024-08-05 12:20:26,243 - INFO - waiting on raft
2024-08-05 12:20:31,244 - INFO - waiting on raft
2024-08-05 12:20:36,244 - INFO - waiting on raft
2024-08-05 12:20:41,245 - INFO - waiting on raft
2024-08-05 12:20:46,245 - INFO - waiting on raft
2024-08-05 12:20:51,246 - INFO - waiting on raft
2024-08-05 12:20:56,247 - INFO - waiting on raft
^C
Aborted!
Connection to 192.168.151.116 closed.

On a side note, the raft support is deprecated in patroni upstream as of 3.0.0.
https://patroni.readthedocs.io/en/latest/releases.html#version-3-0-0

Versions

Operating system: jammy

Juju CLI: 3.5.3

Juju agent: 3.5.3

Charm revision: 14/stable 429

LXD: N/A

Log output

Juju debug log:
model_debug.log

Additional context

@nobuto-m nobuto-m added the bug Something isn't working label Aug 5, 2024
Copy link
Contributor

github-actions bot commented Aug 5, 2024

@delgod
Copy link
Member

delgod commented Aug 6, 2024

On a side note, the raft support is deprecated in patroni upstream as of 3.0.0.

Yes, the raft is not supported upstream, but it is supported and maintained by our Team for all our users (till some point in time).

@nobuto-m
Copy link
Author

Looks like the upstream assumes two PostgreSQL nodes and one witness node. So my understanding is running the cluster only with two nodes is not supported.

https://patroni.readthedocs.io/en/latest/yaml_configuration.html#raft-deprecated

Q: It is possible to run Patroni and PostgreSQL only on two nodes?

A: Yes, on the third node you can run patroni_raft_controller (without Patroni and PostgreSQL). In such a setup, one can temporarily lose one node without affecting the primary.

@taurus-forever
Copy link
Contributor

@7annaba3l we have one more reason to include this in the 25.04 roadmap.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants