Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Handle replication slot conflicts #1762

Merged
merged 33 commits into from
Oct 1, 2024

Conversation

msfstef
Copy link
Contributor

@msfstef msfstef commented Sep 26, 2024

Addresses #1749

  • Makes publication and slot names configurable via a REPLICATION_STREAM_ID env variable, which can ultimately be used for multiple electric deploys
  • Quotes all publication and slot names to address potential issues with configurable names (alternative is to force downcase them when initialised to avoid nasty case-sensitive bugs)
  • Waits for a message from Electric.LockConnection that the lock is acquired before initialising ConnectionManager with the replication stream and shapes.
    • If more than one Electric tries to connect to the same replication slot (with the same REPLICATION_STREAM_ID), it will make a blocking query to acquire the lock that will resolve once the previous Electric using that slot releases it - this addresses rolling deploys, and ensures resources are initialised only once the previous Electric has released them
    • Could potentially switch to pg_try_advisory_lock that is not a blocking query but immediately returns whether the lock could be acquired and implement retries with backoff, but since using pg_advisory_lock simplifies the implementation I decided to start with that and see what people think.

Things that I still need to address:

  • Currently the publication gets altered when a shape is created (adds a table and potentially a row filter) but no cleanup occurs - so the publication can potentially grow to include everything between restarts and deploys even if it is not being used.
    • The way I want to address this is to change the Electric.Postgres.Configuration to alter the publication based on all active shapes rather than based on each individual one, in that case every call will update the publication as necessary and resuming/cleaning can be a matter of calling this every time a shape is deleted and once upon starting (with recovered shapes or no shapes). Can be a separate PR.
    • Created Clean up publication filters when shapes are removed #1774 to address this separately

@msfstef msfstef requested review from alco and robacourt September 26, 2024 14:02
@KyleAMathews
Copy link
Contributor

Does this handle the case where an electric instances crashes and another is created to take its place? The problem there right is that the slot is filled but it'll never be removed because the old server crashes. How shall we detect this and let the new server delete it?

Also what's the testing strategy for this?

@msfstef
Copy link
Contributor Author

msfstef commented Sep 26, 2024

@KyleAMathews in the case of a crash and a subsequent recovery, the new Electric would start consuming the replication slot so there shouldn't be a need to delete it (?) - the only case where the slot should be deleted would be when a cleanup needs to happen, which is either with a controlled shutdown of Electric (without a new one replacing it) or a separate orchestration mechanism that needs to clean up replication slots.

As for testing, I plan to add some integration tests with lux to simulate handoffs and shutdown and recovery, but at the moment I am facing an issue where if I try to acquire a lock with pg_advisory_lock, which results in a query that lasts until the lock is acquired, Postgrex blocks the connection pool process and I can't "query" it for it's status in order to reply to health checks.

Someone with more Elixir experience should be able to help with this (@alco feel free to look into this tomorrow if you want), I started working on a solution with pg_try_advisory_lock but I like the idea of Electric starting to consume the replication stream as soon as the lock is released rather than having retries with backoffs.

@KyleAMathews
Copy link
Contributor

the new Electric would start consuming the replication slot so there shouldn't be a need to delete it

Oh ok so if an instance does, postgres knows the replication slot doesn't have a listener so let's the new instance grab it?

Copy link

netlify bot commented Sep 30, 2024

Deploy Preview for electric-next ready!

Name Link
🔨 Latest commit 67dc96d
🔍 Latest deploy log https://app.netlify.com/sites/electric-next/deploys/66faa56a37d77e0008c0f971
😎 Deploy Preview https://deploy-preview-1762--electric-next.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@msfstef
Copy link
Contributor Author

msfstef commented Sep 30, 2024

postgres knows the replication slot doesn't have a listener so let's the new instance grab it

We rely on the advisory lock to determine that, but essentially yeah, we leverage postgres locks to ensure we're the only ones consuming the replication lot - if we're not, we wait for the lock to be released and then become the sole consumer.

Comment on lines 13 to 18
case get_service_status.() do
:starting -> "starting"
:ready -> "ready"
:active -> "active"
:stopping -> "stopping"
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just to_string(get_service_status.())?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to more explicitly decouple the API results of the health check endpoint from the internal representations of the system state - so we can change internal status but safely keep the API the same or vice versa.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(added comment explaining this to both cases where the matching seems superfluous)

Comment on lines 12 to 16
case connection_status do
:waiting -> :waiting
:starting -> :starting
:active -> :active
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this case needed here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea was that the ServiceStatus package would collect information from more than just the connection manager to determine the status (e.g. storage services, anything else that's long running), and we'd combine the various service states to a final ServiceStatus.status() type.

In this case the connection status and service status are one and the same so it looks a bit odd, but I felt that it makes sense to keep the mapping more explicit

packages/sync-service/lib/electric/lock_connection.ex Outdated Show resolved Hide resolved
packages/sync-service/lib/electric/lock_connection.ex Outdated Show resolved Hide resolved
packages/sync-service/lib/electric/lock_connection.ex Outdated Show resolved Hide resolved
@msfstef msfstef marked this pull request as ready for review September 30, 2024 16:30
@msfstef
Copy link
Contributor Author

msfstef commented Sep 30, 2024

Added integration test for rolling deploys as well - should be ready to review fully again now @alco @robacourt

Copy link
Contributor

@robacourt robacourt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! As discussed, I would really like a warning when it can't acquire the lock to aid with debugging the situation

@msfstef msfstef requested a review from robacourt October 1, 2024 09:35
@msfstef
Copy link
Contributor Author

msfstef commented Oct 1, 2024

@robacourt warning added, as well as a crash recovery integration test for good measure (to capture both rolling deploys and crash recovery scenarios)

Copy link
Contributor

@robacourt robacourt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work!

alco added 2 commits October 1, 2024 17:36
Matching on keywords in Elixir is positional, so if the order of keys at
the call site changes at any point, it would break this start_link()
implementation.
The "with" expression is doing nothing here. Better add control flow
when it's needed rather than "just in case".
Copy link
Member

@alco alco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stellar work! 🥇

.changeset/poor-candles-fly.md Outdated Show resolved Hide resolved
integration-tests/tests/crash-recovery.lux Outdated Show resolved Hide resolved
@msfstef msfstef merged commit 5f6d202 into main Oct 1, 2024
23 checks passed
@msfstef msfstef deleted the msfstef/handle-replication-slot-conflicts branch October 1, 2024 15:52
KyleAMathews pushed a commit that referenced this pull request Nov 1, 2024
Addresses #1749

- Makes publication and slot names configurable via a
`REPLICATION_STREAM_ID` env variable, which can ultimately be used for
multiple electric deploys
- Quotes all publication and slot names to address potential issues with
configurable names (alternative is to force downcase them when
initialised to avoid nasty case-sensitive bugs)
- Waits for a message from `Electric.LockConnection` that the lock is
acquired before initialising `ConnectionManager` with the replication
stream and shapes.
- If more than one Electric tries to connect to the same replication
slot (with the same `REPLICATION_STREAM_ID`), it will make a blocking
query to acquire the lock that will resolve once the previous Electric
using that slot releases it - this addresses rolling deploys, and
ensures resources are initialised only once the previous Electric has
released them
- Could potentially switch to `pg_try_advisory_lock` that is not a
blocking query but immediately returns whether the lock could be
acquired and implement retries with backoff, but since using
`pg_advisory_lock` simplifies the implementation I decided to start with
that and see what people think.
 

Things that I still need to address:
- Currently the publication gets altered when a shape is created (adds a
table and potentially a row filter) but no cleanup occurs - so the
publication can potentially grow to include everything between restarts
and deploys even if it is not being used.
- The way I want to address this is to change the
`Electric.Postgres.Configuration` to alter the publication based on
_all_ active shapes rather than based on each individual one, in that
case every call will update the publication as necessary and
resuming/cleaning can be a matter of calling this every time a shape is
deleted and once upon starting (with recovered shapes or no shapes). Can
be a separate PR.
- Created #1774 to
address this separately

---------

Co-authored-by: Oleksii Sholik <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Handle replication slot conficts for rolling deploys of new Electric servers and restarts of crashed servers
4 participants