Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle replication slot conficts for rolling deploys of new Electric servers and restarts of crashed servers #1749

Closed
KyleAMathews opened this issue Sep 24, 2024 · 8 comments · Fixed by #1762
Assignees
Labels

Comments

@KyleAMathews
Copy link
Contributor

Right now we have a fixed name for our publication slot in Postgres. Which means that you can't do a rolling deploy (i.e. start a new server instance of Electric while the old one is running and then kill the old server once the new one is healthy).

To fix this, we need to use random slot names that won't conflict.

A related issue is we should also add heartbeats from Electric to Postgres as a common problem with hosting is an instance will crash and a new instance will be created to take its place. When an Electric instance crashes, it might not be able to clean up its publication slot so the new instance should be able to see that the old Electric has stopped heartbeating and clean it up for it.

@KyleAMathews KyleAMathews added this to the Production Readiness milestone Sep 24, 2024
@alco
Copy link
Member

alco commented Sep 24, 2024

To avoid the problem of leftover replication slots, we could use the strategy where a single persistent replication slot is created on first run but then another temporary slot is created for the replication connection. Once the connection closes, Postgres will remove the temporary slot automatically.

Electric can clone the persistent slot (with its current LSN and all) into a temporary, randomly named slot every time it opens a new replication connection. This takes care of replication slot conflict and of the cleanup. The last problem remaining is how to ensure exclusive access to the persistent slot when it needs to be modified to advance its LSN forward.

A different strategy for rolling deploys would be for the new Electric instance to start up everything except for the replication connection, trying the latter in a loop. Design its healthcheck endpoint in such a way that it reports OK in this state. The old instance will then go away and the new one will be able to take control of the replication slot.

The transition from the old instance to the new one can be further smoothed out by acquiring a session-level advisory lock on the replication slot. The idea is for Electric to have the lock acquired at all times (requiring one active PG connection) and to report itself as healthy even when its waiting for the lock to be released. Thus, when the old instance shuts down, the new one immediately acquires the lock and opens a replication connection of its own.

@KyleAMathews
Copy link
Contributor Author

@alco all that assumes that the new instance will have access to all the old state of the previous Electric instance? In a stateless rolling deploy, the new instance would recreate all the shapes so wouldn't care about advancing the LSN, etc.

Another consideration too is that for horizontal scalability, we'd want to support multiple electric servers with their own slots.

@alco
Copy link
Member

alco commented Sep 24, 2024

the new instance would recreate all the shapes so wouldn't care about advancing the LSN

That's the easy case. We obviously do not want to do that on every deploy. So yes, I'm assuming the new instance gets attached to the same persistent storage as well as connecting to the same DB.

Another consideration too is that for horizontal scalability, we'd want to support multiple electric servers with their own slots.

Yeah, this is also easier than doing the rolling deploy where the same persistent slot needs to be adopted by a new instance. In the truly parallel setup, each Electric instance's replication slot name could even be statically configured to avoid conflicts.

@KyleAMathews
Copy link
Contributor Author

So yes, I'm assuming the new instance gets attached to the same persistent storage as well as connecting to the same DB.

This is actually probably going to be less common than a pure stateless handoff — it's tricky handing off the persistence storage and we'd perhaps need to do more work? E.g. locking for writes so the two instances don't clobber each other? And is there any in-memory state we need to persist?

Also it's more devops work to setup the persistent disk that instances share — I'm not sure that most people will want to.

We'll want to support the handoff at some point as it'd be nice not to recreate very large shapes (recreating smaller shapes is fine as long as we don't do it constantly). But it doesn't seem like an immediate priority.

@KyleAMathews
Copy link
Contributor Author

Not related to this issue but a general note for the discussion but it'd also be nice to be able to migrate shape by shape. E.g. if you spin up a new server, migrate over shapes one by one until the old server has been drained of shapes and can then be turned off.

This is good for deploys but also handy for rebalancing horizontally scaled servers.

@balegas
Copy link
Contributor

balegas commented Sep 25, 2024

Should probably address server migration from the beginning with the ability to keep shape logs :).

@KyleAMathews KyleAMathews changed the title Handling rolling deploys of new Electric servers Handle replication slot conficts for rolling deploys of new Electric servers and restarts of crashed servers Sep 25, 2024
@KyleAMathews
Copy link
Contributor Author

Should probably address server migration from the beginning with the ability to keep shape logs :).

@balegas — this should come later — we want the cheap simple implementation that solves the problem and then we can optimize for it later. Right now, Electric isn't usable in production so this is a critical blocker that needs fixed ASAP.

@msfstef msfstef self-assigned this Sep 25, 2024
@msfstef
Copy link
Contributor

msfstef commented Sep 26, 2024

I've discussed with @balegas and I will start implementing the latter approach suggested by @alco

  1. Electric boots up and checks if permanent replication slot exists (during this it's in a starting state)
    1a. If it does not exist, it creates it, becomes its consumer and grabs an advisory lock, marking itself as active
    1b. If it does exist, it continuously tries to acquire the advisory lock, marking itself as ready
  2. Orchestration checks if any new Electric is ready, and when it is it shuts the old one down.
  3. The new Electric acquires the lock and it is now active
    3b. Even if multiple Electrics are booted simultaneously, we are guaranteed to only have one consumer for the replications lot by virtue of the lock

If the handoff manages to share permanent storage, then the new Electric is able to resume all shapes from where they left off since the replication stream was left at the exact point where the previous one had left it.

If the storage is not shared, then the new Electric can still recreate all shapes using the same replication stream, with the only potential downside being that the replication slot could be far behind in the case of a shutdown and recovery after a long time and thus require some time to catch up, but that is a case that should be detected and handled by orchestration anyway.

I think this is the simplest approach and guarantees smooth handovers for single Electric deploys, and offers a solid base for expanding on it in terms of orchestration.

@msfstef msfstef linked a pull request Oct 1, 2024 that will close this issue
msfstef added a commit that referenced this issue Oct 1, 2024
Addresses #1749

- Makes publication and slot names configurable via a
`REPLICATION_STREAM_ID` env variable, which can ultimately be used for
multiple electric deploys
- Quotes all publication and slot names to address potential issues with
configurable names (alternative is to force downcase them when
initialised to avoid nasty case-sensitive bugs)
- Waits for a message from `Electric.LockConnection` that the lock is
acquired before initialising `ConnectionManager` with the replication
stream and shapes.
- If more than one Electric tries to connect to the same replication
slot (with the same `REPLICATION_STREAM_ID`), it will make a blocking
query to acquire the lock that will resolve once the previous Electric
using that slot releases it - this addresses rolling deploys, and
ensures resources are initialised only once the previous Electric has
released them
- Could potentially switch to `pg_try_advisory_lock` that is not a
blocking query but immediately returns whether the lock could be
acquired and implement retries with backoff, but since using
`pg_advisory_lock` simplifies the implementation I decided to start with
that and see what people think.
 

Things that I still need to address:
- Currently the publication gets altered when a shape is created (adds a
table and potentially a row filter) but no cleanup occurs - so the
publication can potentially grow to include everything between restarts
and deploys even if it is not being used.
- The way I want to address this is to change the
`Electric.Postgres.Configuration` to alter the publication based on
_all_ active shapes rather than based on each individual one, in that
case every call will update the publication as necessary and
resuming/cleaning can be a matter of calling this every time a shape is
deleted and once upon starting (with recovered shapes or no shapes). Can
be a separate PR.
- Created #1774 to
address this separately

---------

Co-authored-by: Oleksii Sholik <[email protected]>
KyleAMathews pushed a commit that referenced this issue Nov 1, 2024
Addresses #1749

- Makes publication and slot names configurable via a
`REPLICATION_STREAM_ID` env variable, which can ultimately be used for
multiple electric deploys
- Quotes all publication and slot names to address potential issues with
configurable names (alternative is to force downcase them when
initialised to avoid nasty case-sensitive bugs)
- Waits for a message from `Electric.LockConnection` that the lock is
acquired before initialising `ConnectionManager` with the replication
stream and shapes.
- If more than one Electric tries to connect to the same replication
slot (with the same `REPLICATION_STREAM_ID`), it will make a blocking
query to acquire the lock that will resolve once the previous Electric
using that slot releases it - this addresses rolling deploys, and
ensures resources are initialised only once the previous Electric has
released them
- Could potentially switch to `pg_try_advisory_lock` that is not a
blocking query but immediately returns whether the lock could be
acquired and implement retries with backoff, but since using
`pg_advisory_lock` simplifies the implementation I decided to start with
that and see what people think.
 

Things that I still need to address:
- Currently the publication gets altered when a shape is created (adds a
table and potentially a row filter) but no cleanup occurs - so the
publication can potentially grow to include everything between restarts
and deploys even if it is not being used.
- The way I want to address this is to change the
`Electric.Postgres.Configuration` to alter the publication based on
_all_ active shapes rather than based on each individual one, in that
case every call will update the publication as necessary and
resuming/cleaning can be a matter of calling this every time a shape is
deleted and once upon starting (with recovered shapes or no shapes). Can
be a separate PR.
- Created #1774 to
address this separately

---------

Co-authored-by: Oleksii Sholik <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants