Filter replication publication columns #1831

msfstef · 2024-10-10T14:31:35Z

Following #1804 and corresponding PR #1829

We want to alter the publication to only send over the selected/required columns over the replication stream, which would be the selected columns via the column parameter, and potentially also any columns referenced in the where clause if specified.

While PG does support partial replication of a table (at least for PG15 and above I think), there are some warnings in the docs:

it is [...] possible to get into [a problematic] situation by adding or altering column lists on the publication side after a subscription has been created.
This means changing the column lists of tables on publications that are already subscribed could lead to errors being thrown on the subscriber side.

This implies that even with a single publication and subscription, altering the column lists of tables might lead to errors - this needs to be investigated.

The text was updated successfully, but these errors were encountered:

balegas · 2024-11-27T11:21:10Z

After discussing, we think the warning does not affect us, since we're impersonating the protocl and we only have one subscriber for one publication anyways.

msfstef · 2024-12-05T17:00:06Z

Short update on issue that I ran into with this:

It seems that we cannot use REPLICA IDENTITY as FULL, which always sends the old column values of updated rows, if we have column filtering on the publication (see docs)

The error that I get if I do is:

** (Postgrex.Error) ERROR 42P10 (invalid_column_reference) cannot update table "wide_table"
Column list used by the publication does not cover the replica identity.

balegas · 2024-12-16T11:20:58Z

We don't need to set REPLICA IDENTITY to FULL all the times, so we should avoid setting it for the sake of filtering columns in the publication.
We need FULL:

Table doesn't have a primary key defined
the developer requested a shape with param replica=full

In case of a developer requesting a shape with replica=full and that shape is deleted, we should ideally reset the REPLICA IDENTITY back to the default. Note that this is a sensitive operation because it might causes deadlocks.

msfstef · 2024-12-17T09:50:53Z

One more issue discovered is that a where clause on the publication can only reference columns covered by the replica identity, which means that if the where clause references anything other than a primary key we still need replica identity full:

A row filter expression (i.e., the WHERE clause) must contain only columns that are covered by the REPLICA IDENTITY, in order for UPDATE and DELETE operations to be published. For publication of INSERT operations, any column may be used in the WHERE expression.

See docs for more info

We can still configure it on a need-only-basis, but wanted to flag up that there's more cases where it is required.

@icehaunter

Closes #1774 This work started to introduce column filters (see #1831) but ended up on a road block because of us using `REPLICA IDENTITY FULL` - however the work also takes care of cleaning up filters. - Introduced singular process for updating publication - we were locking on it before anyway, might as well linearise it ourselves. - Process maintains reference counted structure for the filters per relation, including where clauses and filtered columns, in order to produce correct overall filters per relation - Update to the publication is debounced to allow batching together many shape creations - Every update does a complete rewrite of the publication filters so they are maintained clean - but also introduced a `remove_shape` call so that if electric remains with no shapes it should also have no subscriptions to tables. ## TODOs - [x] Write tests for `PublicationManager` - [x] Write procedure for recovering in-memory state from `shape_status.list_shapes` in `recover_shapes` - [ ] Split where clauses at top-level `AND`s to improve filter optimality (suggested be @icehaunter ) - [edit: not doing this now, as we can be smart about this an do even more "merging" of where clauses like `x = 1` and `x = 2` to `x in (1, 2)` - separate PR]

balegas · 2024-12-17T12:37:48Z

We have discussed removing publication where clause filters, as we've seen that it is actually slower than doing the filtering in Elixir (...maybe not in all cases, we'd have to check), but if we do this than we'd be good with keeping replica identity default

Can you describe the other issue you mentioned in our call?

msfstef · 2024-12-17T12:41:34Z

The other issue with filtering columns, which requires setting REPLICA IDENTITY DEFAULT, is the following:

Suppose you have a table foo (id, value1, value2, value3)

Create a Shape A that selects columns id,value1
Create a Shape B that selects columns id,value2
At this point the publication would only replicate id,value1,value2, so REPLICA IDENTITY is set to DEFAULT and thus update statements only carry the old values of the primary key cols (i.e. old id values)
An update is made to a row with id=1 such that only value2 is affected
The update comes through, and we only get the full new value of the row with id=1, but can't determine that only value2 was touched by this update since we do not get the old values other than the primary key
We have to send this update operation to both Shape A and Shape B, even though this update should only have gone to Shape B since Shape A does not care about value2.

Therefore by using REPLICA IDENTITY DEFAULT we lose the capacity to filter out changes to shapes with different selected column subsets because we cannot determine whether update operations are relevant or not.

balegas · 2024-12-17T12:52:02Z

The cost would be on the number of shapes defined for a table with different column filters. Do we think there could be many? On the other end, how much would we save in the general case by filtering out columns in the publication?

We're entering the ground of fine-grained optimization. It's not clear what is best for the different workloads. I'm keen in having the possibility of filtering columns on the publication, but I wonder how and when is the best way to expose this to developers, if it isn't one-size-fits all.

msfstef added this to the Production Readiness milestone Oct 10, 2024

balegas mentioned this issue Nov 24, 2024

Clean up publication filters when shapes are removed #1774

Closed

balegas assigned msfstef Nov 27, 2024

msfstef mentioned this issue Dec 11, 2024

feat(sync-service): Clean up publication filters #2154

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filter replication publication columns #1831

Filter replication publication columns #1831

msfstef commented Oct 10, 2024

balegas commented Nov 27, 2024

msfstef commented Dec 5, 2024

balegas commented Dec 16, 2024

msfstef commented Dec 17, 2024

balegas commented Dec 17, 2024 •

edited

Loading

msfstef commented Dec 17, 2024

balegas commented Dec 17, 2024

Filter replication publication columns #1831

Filter replication publication columns #1831

Comments

msfstef commented Oct 10, 2024

balegas commented Nov 27, 2024

msfstef commented Dec 5, 2024

balegas commented Dec 16, 2024

msfstef commented Dec 17, 2024

balegas commented Dec 17, 2024 • edited Loading

msfstef commented Dec 17, 2024

balegas commented Dec 17, 2024

balegas commented Dec 17, 2024 •

edited

Loading