Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filter replication publication columns #1831

Open
msfstef opened this issue Oct 10, 2024 · 7 comments
Open

Filter replication publication columns #1831

msfstef opened this issue Oct 10, 2024 · 7 comments
Assignees

Comments

@msfstef
Copy link
Contributor

msfstef commented Oct 10, 2024

Following #1804 and corresponding PR #1829

We want to alter the publication to only send over the selected/required columns over the replication stream, which would be the selected columns via the column parameter, and potentially also any columns referenced in the where clause if specified.

While PG does support partial replication of a table (at least for PG15 and above I think), there are some warnings in the docs:

it is [...] possible to get into [a problematic] situation by adding or altering column lists on the publication side after a subscription has been created.
This means changing the column lists of tables on publications that are already subscribed could lead to errors being thrown on the subscriber side.

This implies that even with a single publication and subscription, altering the column lists of tables might lead to errors - this needs to be investigated.

@balegas
Copy link
Contributor

balegas commented Nov 27, 2024

After discussing, we think the warning does not affect us, since we're impersonating the protocl and we only have one subscriber for one publication anyways.

@msfstef
Copy link
Contributor Author

msfstef commented Dec 5, 2024

Short update on issue that I ran into with this:

It seems that we cannot use REPLICA IDENTITY as FULL, which always sends the old column values of updated rows, if we have column filtering on the publication (see docs)

The error that I get if I do is:

** (Postgrex.Error) ERROR 42P10 (invalid_column_reference) cannot update table "wide_table"
Column list used by the publication does not cover the replica identity.

@balegas
Copy link
Contributor

balegas commented Dec 16, 2024

We don't need to set REPLICA IDENTITY to FULL all the times, so we should avoid setting it for the sake of filtering columns in the publication.
We need FULL:

  • Table doesn't have a primary key defined
  • the developer requested a shape with param replica=full

In case of a developer requesting a shape with replica=full and that shape is deleted, we should ideally reset the REPLICA IDENTITY back to the default. Note that this is a sensitive operation because it might causes deadlocks.

@msfstef
Copy link
Contributor Author

msfstef commented Dec 17, 2024

One more issue discovered is that a where clause on the publication can only reference columns covered by the replica identity, which means that if the where clause references anything other than a primary key we still need replica identity full:

A row filter expression (i.e., the WHERE clause) must contain only columns that are covered by the REPLICA IDENTITY, in order for UPDATE and DELETE operations to be published. For publication of INSERT operations, any column may be used in the WHERE expression.

See docs for more info

We can still configure it on a need-only-basis, but wanted to flag up that there's more cases where it is required.

msfstef added a commit that referenced this issue Dec 17, 2024
Closes #1774

This work started to introduce column filters (see
#1831) but ended up on a
road block because of us using `REPLICA IDENTITY FULL` - however the
work also takes care of cleaning up filters.

- Introduced singular process for updating publication - we were locking
on it before anyway, might as well linearise it ourselves.
- Process maintains reference counted structure for the filters per
relation, including where clauses and filtered columns, in order to
produce correct overall filters per relation
- Update to the publication is debounced to allow batching together many
shape creations
- Every update does a complete rewrite of the publication filters so
they are maintained clean - but also introduced a `remove_shape` call so
that if electric remains with no shapes it should also have no
subscriptions to tables.


## TODOs
- [x] Write tests for `PublicationManager`
- [x] Write procedure for recovering in-memory state from
`shape_status.list_shapes` in `recover_shapes`
- [ ] Split where clauses at top-level `AND`s to improve filter
optimality (suggested be @icehaunter ) - [edit: not doing this now, as
we can be smart about this an do even more "merging" of where clauses
like `x = 1` and `x = 2` to `x in (1, 2)` - separate PR]
@balegas
Copy link
Contributor

balegas commented Dec 17, 2024

We have discussed removing publication where clause filters, as we've seen that it is actually slower than doing the filtering in Elixir (...maybe not in all cases, we'd have to check), but if we do this than we'd be good with keeping replica identity default

Can you describe the other issue you mentioned in our call?

@msfstef
Copy link
Contributor Author

msfstef commented Dec 17, 2024

The other issue with filtering columns, which requires setting REPLICA IDENTITY DEFAULT, is the following:

Suppose you have a table foo (id, value1, value2, value3)

  1. Create a Shape A that selects columns id,value1
  2. Create a Shape B that selects columns id,value2
  3. At this point the publication would only replicate id,value1,value2, so REPLICA IDENTITY is set to DEFAULT and thus update statements only carry the old values of the primary key cols (i.e. old id values)
  4. An update is made to a row with id=1 such that only value2 is affected
  5. The update comes through, and we only get the full new value of the row with id=1, but can't determine that only value2 was touched by this update since we do not get the old values other than the primary key
  6. We have to send this update operation to both Shape A and Shape B, even though this update should only have gone to Shape B since Shape A does not care about value2.

Therefore by using REPLICA IDENTITY DEFAULT we lose the capacity to filter out changes to shapes with different selected column subsets because we cannot determine whether update operations are relevant or not.

@balegas
Copy link
Contributor

balegas commented Dec 17, 2024

The cost would be on the number of shapes defined for a table with different column filters. Do we think there could be many? On the other end, how much would we save in the general case by filtering out columns in the publication?

We're entering the ground of fine-grained optimization. It's not clear what is best for the different workloads. I'm keen in having the possibility of filtering columns on the publication, but I wonder how and when is the best way to expose this to developers, if it isn't one-size-fits all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants