Clean up publication filters when shapes are removed #1774

msfstef · 2024-10-01T07:45:19Z

Currently the replication publication gets altered when a shape is created (adds a table and potentially a row filter) but no cleanup occurs - so the publication can potentially grow to include everything, even between restarts and deploys, and slow down the system over time.

My suggestion to address this is to change the Electric.Postgres.Configuration to alter the publication based on all active shapes rather than based on each individual one, in that case every call will update the entire publication and its filters as necessary and resuming/cleaning can be a matter of calling this every time a shape is deleted and once upon starting (with recovered shapes or no shapes).

The text was updated successfully, but these errors were encountered:

KyleAMathews · 2024-10-01T13:02:39Z

can be a matter of calling this every time a shape is deleted and once upon starting (with recovered shapes or no shapes).

How much work is it to generate this call? What if 1000s of new shapes are being created every second?

msfstef · 2024-10-01T13:11:07Z

The cleanups in general can be scheduled, but it is important to note that not cleaning them up also means that we're processing data through the replication stream that we don't need, which also takes up resources - delaying a few requests in order to free up resources is probably still a better option, but we'd need to benchmark that. Cleaning it up at any point is probably the first step.

We could also avoid calling it whenever a shape is deleted and only call it whenever a shape is created like we do now, and the creation call will clean up what is not used (and perhaps trigger periodic cleanups as well rather than on every shape deletion).

For reference we're already doing at least 2 roundtrips with Postgres whenever a shape is created in order to adjust the publication filter and that occurs on the critical path, so I doubt the cleaning up will be a bottleneck.

Addresses #1749 - Makes publication and slot names configurable via a `REPLICATION_STREAM_ID` env variable, which can ultimately be used for multiple electric deploys - Quotes all publication and slot names to address potential issues with configurable names (alternative is to force downcase them when initialised to avoid nasty case-sensitive bugs) - Waits for a message from `Electric.LockConnection` that the lock is acquired before initialising `ConnectionManager` with the replication stream and shapes. - If more than one Electric tries to connect to the same replication slot (with the same `REPLICATION_STREAM_ID`), it will make a blocking query to acquire the lock that will resolve once the previous Electric using that slot releases it - this addresses rolling deploys, and ensures resources are initialised only once the previous Electric has released them - Could potentially switch to `pg_try_advisory_lock` that is not a blocking query but immediately returns whether the lock could be acquired and implement retries with backoff, but since using `pg_advisory_lock` simplifies the implementation I decided to start with that and see what people think. Things that I still need to address: - Currently the publication gets altered when a shape is created (adds a table and potentially a row filter) but no cleanup occurs - so the publication can potentially grow to include everything between restarts and deploys even if it is not being used. - The way I want to address this is to change the `Electric.Postgres.Configuration` to alter the publication based on _all_ active shapes rather than based on each individual one, in that case every call will update the publication as necessary and resuming/cleaning can be a matter of calling this every time a shape is deleted and once upon starting (with recovered shapes or no shapes). Can be a separate PR. - Created #1774 to address this separately --------- Co-authored-by: Oleksii Sholik <[email protected]>

balegas · 2024-10-01T16:50:51Z

It's a bit of an orthogonal point to this issue, but we need to re-think how setup filtering on the replication slot.

We cannot set replication filter per shape. We need to be more selective on the WHERE clauses. It will never work to pass filters like WHERE id = x, WHERE id = y (filtering by id) as the number of shapes grow.

If we think at a macro-level, one electric instance needs to be able to handle a certain amount of rows, no matter how many filters you put into the replication slot. If we get to the point that the max tp of electric in not high enough, we need to do shape matching more efficiently.

I'd like to evaluate what are the performance gains just by filtering columns that we don't need, which would be a very simple filter to maintain.
There might be other easy filters we might be able to do

I'll create an issue for evaluating the performance impact of filtering

KyleAMathews · 2024-10-01T16:52:08Z

Columns + tables seems sustainable.

balegas · 2024-10-01T17:08:31Z

yeah subscribing to tables we already do, but we should also drop them from the replication slot when the last shape for that table is gone. Columns + tables will not trigger so frequently.

@msfstef is there a quick route for implementing last shape for table is gone? Otherwise, I think this can be left until we work on #1778

msfstef · 2024-10-02T07:49:28Z

@balegas it would require a bit of rewiring but I think we could do the following:

Boot up Electric and recover any persisted shapes
Build and maintain index reference counting for tables (and in the future perhaps columns as well)
2a. We can also rebuild this every time from the list of shapes, shouldn't be too expensive even with 100k shapes
On every shape addition/removal, update the index and add/remove tables and filters on the publication
3a. This means that we only update the publication if tables and/or columns are added/removed, so most shapes if they select the same columns and operate on the same table with different filters would not affect the publication

Should be a relatively simple solution - and we can leverage an ETS table to avoid going through a single process for this.

Addresses #1749 - Makes publication and slot names configurable via a `REPLICATION_STREAM_ID` env variable, which can ultimately be used for multiple electric deploys - Quotes all publication and slot names to address potential issues with configurable names (alternative is to force downcase them when initialised to avoid nasty case-sensitive bugs) - Waits for a message from `Electric.LockConnection` that the lock is acquired before initialising `ConnectionManager` with the replication stream and shapes. - If more than one Electric tries to connect to the same replication slot (with the same `REPLICATION_STREAM_ID`), it will make a blocking query to acquire the lock that will resolve once the previous Electric using that slot releases it - this addresses rolling deploys, and ensures resources are initialised only once the previous Electric has released them - Could potentially switch to `pg_try_advisory_lock` that is not a blocking query but immediately returns whether the lock could be acquired and implement retries with backoff, but since using `pg_advisory_lock` simplifies the implementation I decided to start with that and see what people think. Things that I still need to address: - Currently the publication gets altered when a shape is created (adds a table and potentially a row filter) but no cleanup occurs - so the publication can potentially grow to include everything between restarts and deploys even if it is not being used. - The way I want to address this is to change the `Electric.Postgres.Configuration` to alter the publication based on _all_ active shapes rather than based on each individual one, in that case every call will update the publication as necessary and resuming/cleaning can be a matter of calling this every time a shape is deleted and once upon starting (with recovered shapes or no shapes). Can be a separate PR. - Created #1774 to address this separately --------- Co-authored-by: Oleksii Sholik <[email protected]>

balegas · 2024-11-24T23:06:28Z

I think that adding the fine-grain filtering to the replication slot remains a bit of a risk (it seems we're abusing it). We should evaluate how adding more filters to the replication slot increases lag/compute cost.

q: do we filter only the columns any shape subscribes too?
UPDATE: found #1831

@icehaunter

Closes #1774 This work started to introduce column filters (see #1831) but ended up on a road block because of us using `REPLICA IDENTITY FULL` - however the work also takes care of cleaning up filters. - Introduced singular process for updating publication - we were locking on it before anyway, might as well linearise it ourselves. - Process maintains reference counted structure for the filters per relation, including where clauses and filtered columns, in order to produce correct overall filters per relation - Update to the publication is debounced to allow batching together many shape creations - Every update does a complete rewrite of the publication filters so they are maintained clean - but also introduced a `remove_shape` call so that if electric remains with no shapes it should also have no subscriptions to tables. ## TODOs - [x] Write tests for `PublicationManager` - [x] Write procedure for recovering in-memory state from `shape_status.list_shapes` in `recover_shapes` - [ ] Split where clauses at top-level `AND`s to improve filter optimality (suggested be @icehaunter ) - [edit: not doing this now, as we can be smart about this an do even more "merging" of where clauses like `x = 1` and `x = 2` to `x in (1, 2)` - separate PR]

msfstef added the reliability label Oct 1, 2024

msfstef mentioned this issue Oct 1, 2024

fix: Handle replication slot conflicts #1762

Merged

balegas added this to the Production Readiness milestone Oct 1, 2024

balegas mentioned this issue Oct 1, 2024

Evaluate the performance impact of replication filtering #1778

Open

balegas mentioned this issue Oct 7, 2024

API parameter to filter columns shape #1804

Closed

msfstef mentioned this issue Oct 8, 2024

Electric fails to detect when its persistent storage goes away and prevents Postgres from cleaning up WAL files #1719

Open

msfstef mentioned this issue Oct 31, 2024

fix: address deadlock with concurrent snapshot creation #1908

Merged

balegas assigned msfstef Nov 27, 2024

msfstef mentioned this issue Dec 11, 2024

feat(sync-service): Clean up publication filters #2154

Merged

3 tasks

msfstef closed this as completed in #2154 Dec 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clean up publication filters when shapes are removed #1774

Clean up publication filters when shapes are removed #1774

msfstef commented Oct 1, 2024

KyleAMathews commented Oct 1, 2024

msfstef commented Oct 1, 2024

balegas commented Oct 1, 2024 •

edited

Loading

KyleAMathews commented Oct 1, 2024

balegas commented Oct 1, 2024

msfstef commented Oct 2, 2024

balegas commented Nov 24, 2024 •

edited

Loading

Clean up publication filters when shapes are removed #1774

Clean up publication filters when shapes are removed #1774

Comments

msfstef commented Oct 1, 2024

KyleAMathews commented Oct 1, 2024

msfstef commented Oct 1, 2024

balegas commented Oct 1, 2024 • edited Loading

KyleAMathews commented Oct 1, 2024

balegas commented Oct 1, 2024

msfstef commented Oct 2, 2024

balegas commented Nov 24, 2024 • edited Loading

balegas commented Oct 1, 2024 •

edited

Loading

balegas commented Nov 24, 2024 •

edited

Loading