-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request juju#17979 from manadart/dqlite-ha-doc
juju#17979 This venerable document, out of date even for Mongo, now reflects HA in the Dqlite world. ## QA steps None required. ## Documentation changes This is one. ## Links **Jira card:** [JUJU-4997](https://warthogs.atlassian.net/browse/JUJU-4997) [JUJU-4997]: https://warthogs.atlassian.net/browse/JUJU-4997?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
- Loading branch information
Showing
1 changed file
with
83 additions
and
110 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,110 +1,83 @@ | ||
High Availability (HA) | ||
====================== | ||
|
||
|
||
High Availability in general terms means that we have 3 or more (up to 7) | ||
State Machines, each one of which can be used as the master. | ||
|
||
This is an overview of how it works: | ||
|
||
### Mongo | ||
_Mongo_ is always started in [replicaset mode](http://docs.mongodb.org/manual/replication/). | ||
|
||
If not in HA, this will behave as if it were a single mongodb and, in practical | ||
terms there is no difference with a regular setup. | ||
|
||
### Voting | ||
|
||
A voting member of the replicaset is a one that has a say in which member is master. | ||
|
||
A non-voting member is just a storage backup. | ||
|
||
Currently we don't support non-voting members; instead when a member is non-voting it | ||
means that said controller is going to be removed entirely. | ||
|
||
### Ensure availability | ||
|
||
There is a `ensure-availabiity` command for juju, it takes `-n` (minimum number | ||
of state machines) as an optional parameter; if it's not provided it will | ||
default to 3. | ||
|
||
This needs to be an odd number in order to prevent ties during voting. | ||
|
||
The number cannot be larger than seven (making the current possibilities: 3, | ||
5 and 7) due to limitations of mongodb, which cannot have more than 7 | ||
replica set voting members. | ||
|
||
Currently the number can be increased but not decreased (this is planned). | ||
In the first case Juju will bring up as many machines as necessary to meet the | ||
requirement; in the second nothing will happen since the rule tries to have | ||
_"at least that many"_ | ||
|
||
At present there is no way to reduce the number of machines, you can kill by | ||
hand enough machines to reduce to a number you need, but this is risky and | ||
**not recommended**. If you kill less than half of the machines (half+1 | ||
remaining) running `enable-ha` again will add more machines to | ||
replace the dead ones. If you kill more there is no way to recover as there | ||
are not enough voting machines. | ||
|
||
The EnableHA API call will report will report the changes that it | ||
made to the model, which will shortly be reflected in reality | ||
### The API | ||
|
||
There is an API server running on all State Machines, these talk to all | ||
the peers but queries and updates are addressed to the mongo master instance. | ||
|
||
Unit and machine agents connect to any of the API servers, by trying to connect | ||
to all the addresses concurrently, but not simultaneously. It starts to try each | ||
address in turn after a short delay. After a successful connection, the | ||
connected address will be stored; it will be tried first when next connecting. | ||
|
||
### The peergrouper worker: | ||
|
||
It looks at the current state and decides what the peergroup members should | ||
look like and continually tries to maintain those members. | ||
|
||
The reason for its existence is that it can often take a while for mongo to | ||
allow a peer group change, so we can't change it directly in the | ||
EnableHA API call | ||
|
||
Its worker loop continally watches | ||
|
||
1. The current set of controllers | ||
2. The addresses of the current controllers | ||
3. The status of the current mongo peergroup | ||
|
||
It feeds all that information into `desiredPeerGroup`, which provides the peer | ||
group that we want to be and continually tries to set that peer group in mongo | ||
until it succeeds. | ||
|
||
**NOTE:** There is one situation which currently doesn't work which is | ||
that if you've only got one controller, you can't switch to another one. | ||
|
||
### The Singleton Workers | ||
|
||
**Note:** This section reflects the current behavior of these workers but | ||
should by no means be taken as an example to follow since most (if not all) | ||
should run concurrently and are going to change in the near future. | ||
|
||
The following workers require only a single instance to be running | ||
at any one moment: | ||
|
||
* The environment provisioner | ||
* The firewaller | ||
* The charm revision updater | ||
* The state cleaner | ||
* The transaction resumer | ||
* The minunits worker | ||
|
||
When a machine agent connects to the state, it decides whether | ||
it is on the same instance as the mongo master instance, and | ||
if so, it runs the singleton workers; otherwise it doesn't run them. | ||
|
||
Because we are using `mgo.Strong` consistency semantics, | ||
it's guaranteed that our mongo connection will be dropped | ||
when the master changes, which means that when the | ||
master changes, the machine agent will reconnect to the | ||
state and choose whether to run the singleton workers again. | ||
|
||
It also means that we can never accidentally have two | ||
singleton workers performing operations at the same time. | ||
# Controller high availability (HA) | ||
|
||
See first: [Juju user docs | How to make a controller highly available] | ||
|
||
This document details controller and agent behaviour when running controllers | ||
in | ||
HA mode. | ||
|
||
## Dqlite | ||
|
||
Each controller is a [Dqlite] node. The `dbaccessor` worker on each controller is | ||
responsible for maintaining the Dqlite cluster. When entering HA mode, the | ||
`dbaccessor` worker will configure the local Dqlite node as a member of the | ||
cluster. | ||
|
||
When starting Dqlite, the worker must bind it to an IP address. The address is | ||
read from the controller configuration file populated by the controller charm. | ||
If there is no address to use for binding, the worker will wait for one to be | ||
written to the file before attempting to join the cluster. | ||
See _Controller Charm_ below. | ||
|
||
Each Dqlite node has a role within the cluster. Juju does not manage node | ||
roles; this is handled within Dqlite itself. A cluster is constituted by: | ||
- one _leader_ to which all database reads and writes are redirected, | ||
- up to two other _voters_ that participate in leader elections, | ||
- _stand-bys_; and | ||
- _spares_. | ||
|
||
If the number of controller instances is reduced to one, the `dbaccessor` | ||
worker detects this scenario and reconfigures the cluster with the local node | ||
as the only member. | ||
|
||
## Controller charm | ||
|
||
The controller charm propagates bind addresses to the `dbaccessor` worker by | ||
writing them to the controller configuration file. Each controller unit shares | ||
its resolved bind address with the other units via the `db-cluster` peer | ||
relation. The charm must be able to determine a unique address in the | ||
local-cloud scope before it is shared with other units and written to the | ||
configuration file. If no unique address can be determined, the user must supply | ||
an endpoint binding for the relation using a space that ensures a unique IP | ||
address. | ||
|
||
## API addresses for agents | ||
|
||
When machines in the control plane change, the `api-address-updater` worker | ||
for each agent re-writes the agent's configuration file with usable API | ||
addresses from all controllers. Agents will try these addresses in random order | ||
until they establish a successful controller connection. | ||
|
||
The list of addresses supplied to agent configuration can be influenced by the | ||
`juju-mgmt-space` controller configuration value. This is supplied with a space | ||
name so that agent-controller communication can be isolated to specific | ||
networks. | ||
|
||
## API addresses for clients | ||
|
||
Each time the Juju client establishes a connection to the Juju controller, the | ||
controller sends the current list of API addresses and the client updates these | ||
in the local store. The client's first connection attempt is always to the last | ||
address that it used successfully. Others are tried subsequently if required. | ||
|
||
Addresses used by clients are not influenced by the `juju-mgmt-space` | ||
configuration. | ||
|
||
## Single instance workers | ||
|
||
Many workers, such as the `dbaccessor` worker, run on all controller instances, | ||
but there are some workers that must run on exactly one controller instance. | ||
An obvious example of this is a model's compute provisioner - we would never | ||
want more than one actor attempting to start a cloud instance for a new | ||
machine. | ||
|
||
Single instance workers are those declared in the model manifolds configuration | ||
that use the `isResponsible` decorator. This in turn is based on a flag set by the | ||
`singular` worker. | ||
|
||
The `singular` worker only sets the flag if it is the current lease holder for | ||
the `singular-controller` namespace. See the appropriate documentation for more | ||
information on leases. | ||
|
||
[Juju user docs | How to make a controller highly available]: https://juju.is/docs/juju/manage-controllers#heading--make-a-controller-highly-available | ||
[Dqlite]: https://dqlite.io/ |