Handle replication slot conficts for rolling deploys of new Electric servers and restarts of crashed servers #1749

KyleAMathews · 2024-09-24T11:05:24Z

Right now we have a fixed name for our publication slot in Postgres. Which means that you can't do a rolling deploy (i.e. start a new server instance of Electric while the old one is running and then kill the old server once the new one is healthy).

To fix this, we need to use random slot names that won't conflict.

A related issue is we should also add heartbeats from Electric to Postgres as a common problem with hosting is an instance will crash and a new instance will be created to take its place. When an Electric instance crashes, it might not be able to clean up its publication slot so the new instance should be able to see that the old Electric has stopped heartbeating and clean it up for it.

alco · 2024-09-24T12:14:56Z

To avoid the problem of leftover replication slots, we could use the strategy where a single persistent replication slot is created on first run but then another temporary slot is created for the replication connection. Once the connection closes, Postgres will remove the temporary slot automatically.

Electric can clone the persistent slot (with its current LSN and all) into a temporary, randomly named slot every time it opens a new replication connection. This takes care of replication slot conflict and of the cleanup. The last problem remaining is how to ensure exclusive access to the persistent slot when it needs to be modified to advance its LSN forward.

A different strategy for rolling deploys would be for the new Electric instance to start up everything except for the replication connection, trying the latter in a loop. Design its healthcheck endpoint in such a way that it reports OK in this state. The old instance will then go away and the new one will be able to take control of the replication slot.

The transition from the old instance to the new one can be further smoothed out by acquiring a session-level advisory lock on the replication slot. The idea is for Electric to have the lock acquired at all times (requiring one active PG connection) and to report itself as healthy even when its waiting for the lock to be released. Thus, when the old instance shuts down, the new one immediately acquires the lock and opens a replication connection of its own.

KyleAMathews · 2024-09-24T12:25:24Z

@alco all that assumes that the new instance will have access to all the old state of the previous Electric instance? In a stateless rolling deploy, the new instance would recreate all the shapes so wouldn't care about advancing the LSN, etc.

Another consideration too is that for horizontal scalability, we'd want to support multiple electric servers with their own slots.

alco · 2024-09-24T12:44:40Z

the new instance would recreate all the shapes so wouldn't care about advancing the LSN

That's the easy case. We obviously do not want to do that on every deploy. So yes, I'm assuming the new instance gets attached to the same persistent storage as well as connecting to the same DB.

Another consideration too is that for horizontal scalability, we'd want to support multiple electric servers with their own slots.

Yeah, this is also easier than doing the rolling deploy where the same persistent slot needs to be adopted by a new instance. In the truly parallel setup, each Electric instance's replication slot name could even be statically configured to avoid conflicts.

KyleAMathews · 2024-09-24T13:24:40Z

So yes, I'm assuming the new instance gets attached to the same persistent storage as well as connecting to the same DB.

This is actually probably going to be less common than a pure stateless handoff — it's tricky handing off the persistence storage and we'd perhaps need to do more work? E.g. locking for writes so the two instances don't clobber each other? And is there any in-memory state we need to persist?

Also it's more devops work to setup the persistent disk that instances share — I'm not sure that most people will want to.

We'll want to support the handoff at some point as it'd be nice not to recreate very large shapes (recreating smaller shapes is fine as long as we don't do it constantly). But it doesn't seem like an immediate priority.

KyleAMathews · 2024-09-24T16:36:12Z

Not related to this issue but a general note for the discussion but it'd also be nice to be able to migrate shape by shape. E.g. if you spin up a new server, migrate over shapes one by one until the old server has been drained of shapes and can then be turned off.

This is good for deploys but also handy for rebalancing horizontally scaled servers.

balegas · 2024-09-25T10:32:21Z

Should probably address server migration from the beginning with the ability to keep shape logs :).

KyleAMathews · 2024-09-25T12:32:52Z

Should probably address server migration from the beginning with the ability to keep shape logs :).

@balegas — this should come later — we want the cheap simple implementation that solves the problem and then we can optimize for it later. Right now, Electric isn't usable in production so this is a critical blocker that needs fixed ASAP.

msfstef · 2024-09-26T09:23:47Z

I've discussed with @balegas and I will start implementing the latter approach suggested by @alco

Electric boots up and checks if permanent replication slot exists (during this it's in a starting state)
1a. If it does not exist, it creates it, becomes its consumer and grabs an advisory lock, marking itself as active
1b. If it does exist, it continuously tries to acquire the advisory lock, marking itself as ready
Orchestration checks if any new Electric is ready, and when it is it shuts the old one down.
The new Electric acquires the lock and it is now active
3b. Even if multiple Electrics are booted simultaneously, we are guaranteed to only have one consumer for the replications lot by virtue of the lock

If the handoff manages to share permanent storage, then the new Electric is able to resume all shapes from where they left off since the replication stream was left at the exact point where the previous one had left it.

If the storage is not shared, then the new Electric can still recreate all shapes using the same replication stream, with the only potential downside being that the replication slot could be far behind in the case of a shutdown and recovery after a long time and thus require some time to catch up, but that is a case that should be detected and handled by orchestration anyway.

I think this is the simplest approach and guarantees smooth handovers for single Electric deploys, and offers a solid base for expanding on it in terms of orchestration.

Addresses #1749 - Makes publication and slot names configurable via a `REPLICATION_STREAM_ID` env variable, which can ultimately be used for multiple electric deploys - Quotes all publication and slot names to address potential issues with configurable names (alternative is to force downcase them when initialised to avoid nasty case-sensitive bugs) - Waits for a message from `Electric.LockConnection` that the lock is acquired before initialising `ConnectionManager` with the replication stream and shapes. - If more than one Electric tries to connect to the same replication slot (with the same `REPLICATION_STREAM_ID`), it will make a blocking query to acquire the lock that will resolve once the previous Electric using that slot releases it - this addresses rolling deploys, and ensures resources are initialised only once the previous Electric has released them - Could potentially switch to `pg_try_advisory_lock` that is not a blocking query but immediately returns whether the lock could be acquired and implement retries with backoff, but since using `pg_advisory_lock` simplifies the implementation I decided to start with that and see what people think. Things that I still need to address: - Currently the publication gets altered when a shape is created (adds a table and potentially a row filter) but no cleanup occurs - so the publication can potentially grow to include everything between restarts and deploys even if it is not being used. - The way I want to address this is to change the `Electric.Postgres.Configuration` to alter the publication based on _all_ active shapes rather than based on each individual one, in that case every call will update the publication as necessary and resuming/cleaning can be a matter of calling this every time a shape is deleted and once upon starting (with recovered shapes or no shapes). Can be a separate PR. - Created #1774 to address this separately --------- Co-authored-by: Oleksii Sholik <[email protected]>

KyleAMathews added this to the Production Readiness milestone Sep 24, 2024

KyleAMathews changed the title ~~Handling rolling deploys of new Electric servers~~ Handle replication slot conficts for rolling deploys of new Electric servers and restarts of crashed servers Sep 25, 2024

KyleAMathews added the bug label Sep 25, 2024

msfstef self-assigned this Sep 25, 2024

msfstef mentioned this issue Sep 26, 2024

fix: Handle replication slot conflicts #1762

Merged

msfstef linked a pull request Oct 1, 2024 that will close this issue

fix: Handle replication slot conflicts #1762

Merged

msfstef closed this as completed in #1762 Oct 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle replication slot conficts for rolling deploys of new Electric servers and restarts of crashed servers #1749

Handle replication slot conficts for rolling deploys of new Electric servers and restarts of crashed servers #1749

KyleAMathews commented Sep 24, 2024

alco commented Sep 24, 2024

KyleAMathews commented Sep 24, 2024

alco commented Sep 24, 2024

KyleAMathews commented Sep 24, 2024

KyleAMathews commented Sep 24, 2024

balegas commented Sep 25, 2024

KyleAMathews commented Sep 25, 2024

msfstef commented Sep 26, 2024

Handle replication slot conficts for rolling deploys of new Electric servers and restarts of crashed servers #1749

Handle replication slot conficts for rolling deploys of new Electric servers and restarts of crashed servers #1749

Comments

KyleAMathews commented Sep 24, 2024

alco commented Sep 24, 2024

KyleAMathews commented Sep 24, 2024

alco commented Sep 24, 2024

KyleAMathews commented Sep 24, 2024

KyleAMathews commented Sep 24, 2024

balegas commented Sep 25, 2024

KyleAMathews commented Sep 25, 2024

msfstef commented Sep 26, 2024