fix: Handle replication slot conflicts #1762

msfstef · 2024-09-26T14:02:17Z

Addresses #1749

Makes publication and slot names configurable via a REPLICATION_STREAM_ID env variable, which can ultimately be used for multiple electric deploys
Quotes all publication and slot names to address potential issues with configurable names (alternative is to force downcase them when initialised to avoid nasty case-sensitive bugs)
Waits for a message from Electric.LockConnection that the lock is acquired before initialising ConnectionManager with the replication stream and shapes.
- If more than one Electric tries to connect to the same replication slot (with the same REPLICATION_STREAM_ID), it will make a blocking query to acquire the lock that will resolve once the previous Electric using that slot releases it - this addresses rolling deploys, and ensures resources are initialised only once the previous Electric has released them
- Could potentially switch to pg_try_advisory_lock that is not a blocking query but immediately returns whether the lock could be acquired and implement retries with backoff, but since using pg_advisory_lock simplifies the implementation I decided to start with that and see what people think.

Things that I still need to address:

Currently the publication gets altered when a shape is created (adds a table and potentially a row filter) but no cleanup occurs - so the publication can potentially grow to include everything between restarts and deploys even if it is not being used.
- The way I want to address this is to change the Electric.Postgres.Configuration to alter the publication based on all active shapes rather than based on each individual one, in that case every call will update the publication as necessary and resuming/cleaning can be a matter of calling this every time a shape is deleted and once upon starting (with recovered shapes or no shapes). Can be a separate PR.
- Created Clean up publication filters when shapes are removed #1774 to address this separately

KyleAMathews · 2024-09-26T18:52:19Z

Does this handle the case where an electric instances crashes and another is created to take its place? The problem there right is that the slot is filled but it'll never be removed because the old server crashes. How shall we detect this and let the new server delete it?

Also what's the testing strategy for this?

msfstef · 2024-09-26T19:01:43Z

@KyleAMathews in the case of a crash and a subsequent recovery, the new Electric would start consuming the replication slot so there shouldn't be a need to delete it (?) - the only case where the slot should be deleted would be when a cleanup needs to happen, which is either with a controlled shutdown of Electric (without a new one replacing it) or a separate orchestration mechanism that needs to clean up replication slots.

As for testing, I plan to add some integration tests with lux to simulate handoffs and shutdown and recovery, but at the moment I am facing an issue where if I try to acquire a lock with pg_advisory_lock, which results in a query that lasts until the lock is acquired, Postgrex blocks the connection pool process and I can't "query" it for it's status in order to reply to health checks.

Someone with more Elixir experience should be able to help with this (@alco feel free to look into this tomorrow if you want), I started working on a solution with pg_try_advisory_lock but I like the idea of Electric starting to consume the replication stream as soon as the lock is released rather than having retries with backoffs.

KyleAMathews · 2024-09-26T19:24:27Z

the new Electric would start consuming the replication slot so there shouldn't be a need to delete it

Oh ok so if an instance does, postgres knows the replication slot doesn't have a listener so let's the new instance grab it?

netlify · 2024-09-30T13:20:44Z

✅ Deploy Preview for electric-next ready!

Name	Link
🔨 Latest commit	`67dc96d`
🔍 Latest deploy log	https://app.netlify.com/sites/electric-next/deploys/66faa56a37d77e0008c0f971
😎 Deploy Preview	https://deploy-preview-1762--electric-next.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

msfstef · 2024-09-30T14:17:07Z

postgres knows the replication slot doesn't have a listener so let's the new instance grab it

We rely on the advisory lock to determine that, but essentially yeah, we leverage postgres locks to ensure we're the only ones consuming the replication lot - if we're not, we wait for the lock to be released and then become the sole consumer.

alco · 2024-09-30T14:31:43Z

packages/sync-service/lib/electric/plug/health_check_plug.ex

+      case get_service_status.() do
+        :starting -> "starting"
+        :ready -> "ready"
+        :active -> "active"
+        :stopping -> "stopping"
+      end


Why not just to_string(get_service_status.())?

I was trying to more explicitly decouple the API results of the health check endpoint from the internal representations of the system state - so we can change internal status but safely keep the API the same or vice versa.

(added comment explaining this to both cases where the matching seems superfluous)

alco · 2024-09-30T14:32:31Z

packages/sync-service/lib/electric/service_status.ex

+      case connection_status do
+        :waiting -> :waiting
+        :starting -> :starting
+        :active -> :active
+      end


Why is this case needed here?

The idea was that the ServiceStatus package would collect information from more than just the connection manager to determine the status (e.g. storage services, anything else that's long running), and we'd combine the various service states to a final ServiceStatus.status() type.

In this case the connection status and service status are one and the same so it looks a bit odd, but I felt that it makes sense to keep the mapping more explicit

packages/sync-service/lib/electric/lock_connection.ex

packages/sync-service/lib/electric/plug/health_check_plug.ex

msfstef · 2024-09-30T16:30:52Z

Added integration test for rolling deploys as well - should be ready to review fully again now @alco @robacourt

robacourt

Great work! As discussed, I would really like a warning when it can't acquire the lock to aid with debugging the situation

msfstef · 2024-10-01T09:35:38Z

@robacourt warning added, as well as a crash recovery integration test for good measure (to capture both rolling deploys and crash recovery scenarios)

packages/sync-service/lib/electric/connection_manager.ex

robacourt

Great work!

Matching on keywords in Elixir is positional, so if the order of keys at the call site changes at any point, it would break this start_link() implementation.

The "with" expression is doing nothing here. Better add control flow when it's needed rather than "just in case".

alco

Stellar work! 🥇

.changeset/poor-candles-fly.md

integration-tests/tests/crash-recovery.lux

Addresses #1749 - Makes publication and slot names configurable via a `REPLICATION_STREAM_ID` env variable, which can ultimately be used for multiple electric deploys - Quotes all publication and slot names to address potential issues with configurable names (alternative is to force downcase them when initialised to avoid nasty case-sensitive bugs) - Waits for a message from `Electric.LockConnection` that the lock is acquired before initialising `ConnectionManager` with the replication stream and shapes. - If more than one Electric tries to connect to the same replication slot (with the same `REPLICATION_STREAM_ID`), it will make a blocking query to acquire the lock that will resolve once the previous Electric using that slot releases it - this addresses rolling deploys, and ensures resources are initialised only once the previous Electric has released them - Could potentially switch to `pg_try_advisory_lock` that is not a blocking query but immediately returns whether the lock could be acquired and implement retries with backoff, but since using `pg_advisory_lock` simplifies the implementation I decided to start with that and see what people think. Things that I still need to address: - Currently the publication gets altered when a shape is created (adds a table and potentially a row filter) but no cleanup occurs - so the publication can potentially grow to include everything between restarts and deploys even if it is not being used. - The way I want to address this is to change the `Electric.Postgres.Configuration` to alter the publication based on _all_ active shapes rather than based on each individual one, in that case every call will update the publication as necessary and resuming/cleaning can be a matter of calling this every time a shape is deleted and once upon starting (with recovered shapes or no shapes). Can be a separate PR. - Created #1774 to address this separately --------- Co-authored-by: Oleksii Sholik <[email protected]>

msfstef added 5 commits September 26, 2024 15:26

Use advisory lock to handle rolling deploys

5453e37

Configurable replication stream ID

a6dfec4

Quote all publication and replication slot names

7cf3f38

Implement utility for resetting the publication

3b0dba8

Fix utils quoting test

a7ef1b5

msfstef requested review from alco and robacourt September 26, 2024 14:02

msfstef added 4 commits September 26, 2024 17:05

Use hash of slot name for lock

09db776

Fix integration test

20285c4

Add changeset

51291ae

Basic replication status work

0acac52

msfstef added 3 commits September 30, 2024 10:33

Remove connection manager as required opt

4e34836

Use an independent connection lock

fbaa5a6

Add backoff retries

67dc96d

msfstef added 3 commits September 30, 2024 16:25

Check pool id as well for status

267d471

Fix integration test and lock restarting

a1836a7

Rename connection lock atom

b653044

msfstef added 2 commits September 30, 2024 17:28

Fix health check endpoint to return correct messages

3f47fbd

Add basic unit tests for health check endpoint

4e8e3e6

alco reviewed Sep 30, 2024

View reviewed changes

msfstef added 5 commits September 30, 2024 18:37

Add integration test for rolling deploy

3bba4e2

Fix weird formatting

b203953

Return 503 for stopping state

a7e1897

Transform start link keyword argument

017ed32

Mock lock connection under Postgres

f2d6250

msfstef added 5 commits September 30, 2024 19:05

Add basic lock connection tests

934c35d

Add comments explaining seemingly unnecessary pattern matching

4655e00

Update changeset

49d6620

Add basic integration test for health endoint

3269786

Add health check plug to integration testing for rolling deploys

3ac6845

msfstef marked this pull request as ready for review September 30, 2024 16:30

msfstef requested a review from alco September 30, 2024 16:30

msfstef linked an issue Oct 1, 2024 that may be closed by this pull request

Handle replication slot conficts for rolling deploys of new Electric servers and restarts of crashed servers #1749

Closed

Assert electric process exits

50e95f0

robacourt requested changes Oct 1, 2024

View reviewed changes

balegas mentioned this pull request Oct 1, 2024

Add tracing for connecting to postgres #1738

Open

msfstef added 2 commits October 1, 2024 12:01

Log periodic messages while lock is not acquired

685bc4f

Add crash recovery integration test

f22c669

msfstef requested a review from robacourt October 1, 2024 09:35

robacourt reviewed Oct 1, 2024

View reviewed changes

packages/sync-service/lib/electric/connection_manager.ex Show resolved Hide resolved

robacourt approved these changes Oct 1, 2024

View reviewed changes

alco added 2 commits October 1, 2024 17:36

Replace positional matching on a keyword list with Keyword.pop()

c024581

Matching on keywords in Elixir is positional, so if the order of keys at the call site changes at any point, it would break this start_link() implementation.

Remove a noop "with" from ServiceStatus.check()

b0d928b

The "with" expression is doing nothing here. Better add control flow when it's needed rather than "just in case".

alco approved these changes Oct 1, 2024

View reviewed changes

.changeset/poor-candles-fly.md Outdated Show resolved Hide resolved

integration-tests/tests/crash-recovery.lux Outdated Show resolved Hide resolved

Adress PR comments

dc8c980

msfstef merged commit 5f6d202 into main Oct 1, 2024
23 checks passed

msfstef deleted the msfstef/handle-replication-slot-conflicts branch October 1, 2024 15:52

alco mentioned this pull request Oct 3, 2024

Electric no longer performs automatic fallback to disabling SSL for database connections #1792

Closed

KyleAMathews mentioned this pull request Oct 16, 2024

Horizontally scalable #1459

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Handle replication slot conflicts #1762

fix: Handle replication slot conflicts #1762

msfstef commented Sep 26, 2024 •

edited

Loading

KyleAMathews commented Sep 26, 2024

msfstef commented Sep 26, 2024

KyleAMathews commented Sep 26, 2024

netlify bot commented Sep 30, 2024

msfstef commented Sep 30, 2024 •

edited

Loading

alco Sep 30, 2024

msfstef Sep 30, 2024

msfstef Sep 30, 2024

alco Sep 30, 2024

msfstef Sep 30, 2024

msfstef commented Sep 30, 2024 •

edited

Loading

robacourt left a comment

msfstef commented Oct 1, 2024

robacourt left a comment

alco left a comment

fix: Handle replication slot conflicts #1762

fix: Handle replication slot conflicts #1762

Conversation

msfstef commented Sep 26, 2024 • edited Loading

KyleAMathews commented Sep 26, 2024

msfstef commented Sep 26, 2024

KyleAMathews commented Sep 26, 2024

netlify bot commented Sep 30, 2024

✅ Deploy Preview for electric-next ready!

msfstef commented Sep 30, 2024 • edited Loading

alco Sep 30, 2024

Choose a reason for hiding this comment

msfstef Sep 30, 2024

Choose a reason for hiding this comment

msfstef Sep 30, 2024

Choose a reason for hiding this comment

alco Sep 30, 2024

Choose a reason for hiding this comment

msfstef Sep 30, 2024

Choose a reason for hiding this comment

msfstef commented Sep 30, 2024 • edited Loading

robacourt left a comment

Choose a reason for hiding this comment

msfstef commented Oct 1, 2024

robacourt left a comment

Choose a reason for hiding this comment

alco left a comment

Choose a reason for hiding this comment

msfstef commented Sep 26, 2024 •

edited

Loading

msfstef commented Sep 30, 2024 •

edited

Loading

msfstef commented Sep 30, 2024 •

edited

Loading