Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up publication filters when shapes are removed #1774

Closed
msfstef opened this issue Oct 1, 2024 · 7 comments · Fixed by #2154
Closed

Clean up publication filters when shapes are removed #1774

msfstef opened this issue Oct 1, 2024 · 7 comments · Fixed by #2154
Assignees

Comments

@msfstef
Copy link
Contributor

msfstef commented Oct 1, 2024

Currently the replication publication gets altered when a shape is created (adds a table and potentially a row filter) but no cleanup occurs - so the publication can potentially grow to include everything, even between restarts and deploys, and slow down the system over time.

My suggestion to address this is to change the Electric.Postgres.Configuration to alter the publication based on all active shapes rather than based on each individual one, in that case every call will update the entire publication and its filters as necessary and resuming/cleaning can be a matter of calling this every time a shape is deleted and once upon starting (with recovered shapes or no shapes).

@KyleAMathews
Copy link
Contributor

can be a matter of calling this every time a shape is deleted and once upon starting (with recovered shapes or no shapes).

How much work is it to generate this call? What if 1000s of new shapes are being created every second?

@msfstef
Copy link
Contributor Author

msfstef commented Oct 1, 2024

The cleanups in general can be scheduled, but it is important to note that not cleaning them up also means that we're processing data through the replication stream that we don't need, which also takes up resources - delaying a few requests in order to free up resources is probably still a better option, but we'd need to benchmark that. Cleaning it up at any point is probably the first step.

We could also avoid calling it whenever a shape is deleted and only call it whenever a shape is created like we do now, and the creation call will clean up what is not used (and perhaps trigger periodic cleanups as well rather than on every shape deletion).

For reference we're already doing at least 2 roundtrips with Postgres whenever a shape is created in order to adjust the publication filter and that occurs on the critical path, so I doubt the cleaning up will be a bottleneck.

msfstef added a commit that referenced this issue Oct 1, 2024
Addresses #1749

- Makes publication and slot names configurable via a
`REPLICATION_STREAM_ID` env variable, which can ultimately be used for
multiple electric deploys
- Quotes all publication and slot names to address potential issues with
configurable names (alternative is to force downcase them when
initialised to avoid nasty case-sensitive bugs)
- Waits for a message from `Electric.LockConnection` that the lock is
acquired before initialising `ConnectionManager` with the replication
stream and shapes.
- If more than one Electric tries to connect to the same replication
slot (with the same `REPLICATION_STREAM_ID`), it will make a blocking
query to acquire the lock that will resolve once the previous Electric
using that slot releases it - this addresses rolling deploys, and
ensures resources are initialised only once the previous Electric has
released them
- Could potentially switch to `pg_try_advisory_lock` that is not a
blocking query but immediately returns whether the lock could be
acquired and implement retries with backoff, but since using
`pg_advisory_lock` simplifies the implementation I decided to start with
that and see what people think.
 

Things that I still need to address:
- Currently the publication gets altered when a shape is created (adds a
table and potentially a row filter) but no cleanup occurs - so the
publication can potentially grow to include everything between restarts
and deploys even if it is not being used.
- The way I want to address this is to change the
`Electric.Postgres.Configuration` to alter the publication based on
_all_ active shapes rather than based on each individual one, in that
case every call will update the publication as necessary and
resuming/cleaning can be a matter of calling this every time a shape is
deleted and once upon starting (with recovered shapes or no shapes). Can
be a separate PR.
- Created #1774 to
address this separately

---------

Co-authored-by: Oleksii Sholik <[email protected]>
@balegas
Copy link
Contributor

balegas commented Oct 1, 2024

It's a bit of an orthogonal point to this issue, but we need to re-think how setup filtering on the replication slot.

We cannot set replication filter per shape. We need to be more selective on the WHERE clauses. It will never work to pass filters like WHERE id = x, WHERE id = y (filtering by id) as the number of shapes grow.

If we think at a macro-level, one electric instance needs to be able to handle a certain amount of rows, no matter how many filters you put into the replication slot. If we get to the point that the max tp of electric in not high enough, we need to do shape matching more efficiently.

  • I'd like to evaluate what are the performance gains just by filtering columns that we don't need, which would be a very simple filter to maintain.
  • There might be other easy filters we might be able to do

I'll create an issue for evaluating the performance impact of filtering

@KyleAMathews
Copy link
Contributor

Columns + tables seems sustainable.

@balegas
Copy link
Contributor

balegas commented Oct 1, 2024

yeah subscribing to tables we already do, but we should also drop them from the replication slot when the last shape for that table is gone. Columns + tables will not trigger so frequently.

@msfstef is there a quick route for implementing last shape for table is gone? Otherwise, I think this can be left until we work on #1778

@msfstef
Copy link
Contributor Author

msfstef commented Oct 2, 2024

@balegas it would require a bit of rewiring but I think we could do the following:

  1. Boot up Electric and recover any persisted shapes
  2. Build and maintain index reference counting for tables (and in the future perhaps columns as well)
    2a. We can also rebuild this every time from the list of shapes, shouldn't be too expensive even with 100k shapes
  3. On every shape addition/removal, update the index and add/remove tables and filters on the publication
    3a. This means that we only update the publication if tables and/or columns are added/removed, so most shapes if they select the same columns and operate on the same table with different filters would not affect the publication

Should be a relatively simple solution - and we can leverage an ETS table to avoid going through a single process for this.

KyleAMathews pushed a commit that referenced this issue Nov 1, 2024
Addresses #1749

- Makes publication and slot names configurable via a
`REPLICATION_STREAM_ID` env variable, which can ultimately be used for
multiple electric deploys
- Quotes all publication and slot names to address potential issues with
configurable names (alternative is to force downcase them when
initialised to avoid nasty case-sensitive bugs)
- Waits for a message from `Electric.LockConnection` that the lock is
acquired before initialising `ConnectionManager` with the replication
stream and shapes.
- If more than one Electric tries to connect to the same replication
slot (with the same `REPLICATION_STREAM_ID`), it will make a blocking
query to acquire the lock that will resolve once the previous Electric
using that slot releases it - this addresses rolling deploys, and
ensures resources are initialised only once the previous Electric has
released them
- Could potentially switch to `pg_try_advisory_lock` that is not a
blocking query but immediately returns whether the lock could be
acquired and implement retries with backoff, but since using
`pg_advisory_lock` simplifies the implementation I decided to start with
that and see what people think.
 

Things that I still need to address:
- Currently the publication gets altered when a shape is created (adds a
table and potentially a row filter) but no cleanup occurs - so the
publication can potentially grow to include everything between restarts
and deploys even if it is not being used.
- The way I want to address this is to change the
`Electric.Postgres.Configuration` to alter the publication based on
_all_ active shapes rather than based on each individual one, in that
case every call will update the publication as necessary and
resuming/cleaning can be a matter of calling this every time a shape is
deleted and once upon starting (with recovered shapes or no shapes). Can
be a separate PR.
- Created #1774 to
address this separately

---------

Co-authored-by: Oleksii Sholik <[email protected]>
@balegas
Copy link
Contributor

balegas commented Nov 24, 2024

I think that adding the fine-grain filtering to the replication slot remains a bit of a risk (it seems we're abusing it). We should evaluate how adding more filters to the replication slot increases lag/compute cost.

q: do we filter only the columns any shape subscribes too?
UPDATE: found #1831

msfstef added a commit that referenced this issue Dec 17, 2024
Closes #1774

This work started to introduce column filters (see
#1831) but ended up on a
road block because of us using `REPLICA IDENTITY FULL` - however the
work also takes care of cleaning up filters.

- Introduced singular process for updating publication - we were locking
on it before anyway, might as well linearise it ourselves.
- Process maintains reference counted structure for the filters per
relation, including where clauses and filtered columns, in order to
produce correct overall filters per relation
- Update to the publication is debounced to allow batching together many
shape creations
- Every update does a complete rewrite of the publication filters so
they are maintained clean - but also introduced a `remove_shape` call so
that if electric remains with no shapes it should also have no
subscriptions to tables.


## TODOs
- [x] Write tests for `PublicationManager`
- [x] Write procedure for recovering in-memory state from
`shape_status.list_shapes` in `recover_shapes`
- [ ] Split where clauses at top-level `AND`s to improve filter
optimality (suggested be @icehaunter ) - [edit: not doing this now, as
we can be smart about this an do even more "merging" of where clauses
like `x = 1` and `x = 2` to `x in (1, 2)` - separate PR]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants