-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Handle replication slot conflicts #1762
Conversation
Does this handle the case where an electric instances crashes and another is created to take its place? The problem there right is that the slot is filled but it'll never be removed because the old server crashes. How shall we detect this and let the new server delete it? Also what's the testing strategy for this? |
@KyleAMathews in the case of a crash and a subsequent recovery, the new Electric would start consuming the replication slot so there shouldn't be a need to delete it (?) - the only case where the slot should be deleted would be when a cleanup needs to happen, which is either with a controlled shutdown of Electric (without a new one replacing it) or a separate orchestration mechanism that needs to clean up replication slots. As for testing, I plan to add some integration tests with lux to simulate handoffs and shutdown and recovery, but at the moment I am facing an issue where if I try to acquire a lock with Someone with more Elixir experience should be able to help with this (@alco feel free to look into this tomorrow if you want), I started working on a solution with |
Oh ok so if an instance does, postgres knows the replication slot doesn't have a listener so let's the new instance grab it? |
✅ Deploy Preview for electric-next ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
We rely on the advisory lock to determine that, but essentially yeah, we leverage postgres locks to ensure we're the only ones consuming the replication lot - if we're not, we wait for the lock to be released and then become the sole consumer. |
case get_service_status.() do | ||
:starting -> "starting" | ||
:ready -> "ready" | ||
:active -> "active" | ||
:stopping -> "stopping" | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just to_string(get_service_status.())
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to more explicitly decouple the API results of the health check endpoint from the internal representations of the system state - so we can change internal status but safely keep the API the same or vice versa.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(added comment explaining this to both cases where the matching seems superfluous)
case connection_status do | ||
:waiting -> :waiting | ||
:starting -> :starting | ||
:active -> :active | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this case
needed here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea was that the ServiceStatus
package would collect information from more than just the connection manager to determine the status (e.g. storage services, anything else that's long running), and we'd combine the various service states to a final ServiceStatus.status()
type.
In this case the connection status and service status are one and the same so it looks a bit odd, but I felt that it makes sense to keep the mapping more explicit
Added integration test for rolling deploys as well - should be ready to review fully again now @alco @robacourt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! As discussed, I would really like a warning when it can't acquire the lock to aid with debugging the situation
@robacourt warning added, as well as a crash recovery integration test for good measure (to capture both rolling deploys and crash recovery scenarios) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work!
Matching on keywords in Elixir is positional, so if the order of keys at the call site changes at any point, it would break this start_link() implementation.
The "with" expression is doing nothing here. Better add control flow when it's needed rather than "just in case".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stellar work! 🥇
Addresses #1749 - Makes publication and slot names configurable via a `REPLICATION_STREAM_ID` env variable, which can ultimately be used for multiple electric deploys - Quotes all publication and slot names to address potential issues with configurable names (alternative is to force downcase them when initialised to avoid nasty case-sensitive bugs) - Waits for a message from `Electric.LockConnection` that the lock is acquired before initialising `ConnectionManager` with the replication stream and shapes. - If more than one Electric tries to connect to the same replication slot (with the same `REPLICATION_STREAM_ID`), it will make a blocking query to acquire the lock that will resolve once the previous Electric using that slot releases it - this addresses rolling deploys, and ensures resources are initialised only once the previous Electric has released them - Could potentially switch to `pg_try_advisory_lock` that is not a blocking query but immediately returns whether the lock could be acquired and implement retries with backoff, but since using `pg_advisory_lock` simplifies the implementation I decided to start with that and see what people think. Things that I still need to address: - Currently the publication gets altered when a shape is created (adds a table and potentially a row filter) but no cleanup occurs - so the publication can potentially grow to include everything between restarts and deploys even if it is not being used. - The way I want to address this is to change the `Electric.Postgres.Configuration` to alter the publication based on _all_ active shapes rather than based on each individual one, in that case every call will update the publication as necessary and resuming/cleaning can be a matter of calling this every time a shape is deleted and once upon starting (with recovered shapes or no shapes). Can be a separate PR. - Created #1774 to address this separately --------- Co-authored-by: Oleksii Sholik <[email protected]>
Addresses #1749
REPLICATION_STREAM_ID
env variable, which can ultimately be used for multiple electric deploysElectric.LockConnection
that the lock is acquired before initialisingConnectionManager
with the replication stream and shapes.REPLICATION_STREAM_ID
), it will make a blocking query to acquire the lock that will resolve once the previous Electric using that slot releases it - this addresses rolling deploys, and ensures resources are initialised only once the previous Electric has released thempg_try_advisory_lock
that is not a blocking query but immediately returns whether the lock could be acquired and implement retries with backoff, but since usingpg_advisory_lock
simplifies the implementation I decided to start with that and see what people think.Things that I still need to address:
Electric.Postgres.Configuration
to alter the publication based on all active shapes rather than based on each individual one, in that case every call will update the publication as necessary and resuming/cleaning can be a matter of calling this every time a shape is deleted and once upon starting (with recovered shapes or no shapes). Can be a separate PR.