Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add columns API query parameter to filter table columns #1829

Merged
merged 19 commits into from
Oct 14, 2024

Conversation

msfstef
Copy link
Contributor

@msfstef msfstef commented Oct 10, 2024

Implements #1804

Overview

  • I've picked columns as the query parameter name, as select might confuse people and get compared to SQL SELECTs, and filter is too close to what the where parameter does.
  • The columns are comma separated column names, parsed at the validation stage
  • If the parameter is specified, it has to include the primary key columns
    • We could make it such that the PK is always included, but it feels like it might be confusing - explicit is better?
    • For tables without a primary key, we treat all columns as primary keys - we might want to revisit that w.r.t. to this feature
  • Columns are validated also by ensuring that all specified column names match the ones in the schema
  • The selected columns are stored in the Shape definition as an array of column names
    • We had the choice of potentially filtering the table_info data based on the selection of columns instead of storing them separately, but this might cause issues with relation change handling and generally I prefer the definition of the shape to contain all information related to it (i.e. the schema at the time of creation + what column filters where applied, rather than compacting the two)
  • I modified Shape related APIs for converting changes etc to also apply the column filtering - this is all we need to ensure it works correctly for replication stream log entries
  • In the Snapshotter, I modified a get_column_names method to also apply the filtering if present, which takes care of the snapshot log entries.

Other changes

  • I've changed Shapes.new to return the errors along with the field they are associated with (:root_table or :where or :columns) in order to return correct validation errors, since all of the stronger validation occrus at cast_root_table when the shape is created and PG is used but it really validates more than just root_table.
  • I've updated the client to accept a columns argument which is a list of strings
  • I've updated the OpenAPI spec to include the columns parameter

Things to consider

  • How do we want to handle column names with special names (quoted)? In my opinion the client needs to type them exactly as they are on postgres, otherwise they get a 400 validation error back telling them which columns are invalid, so it should be fairly easy to fix.
  • Replication publication filtering of columns is available, but updating it might cause errors according to the PG docs, so I'm not sure if that's something we want to consider

@msfstef msfstef linked an issue Oct 10, 2024 that may be closed by this pull request
Copy link

netlify bot commented Oct 10, 2024

Deploy Preview for electric-next ready!

Name Link
🔨 Latest commit 7b5aa03
🔍 Latest deploy log https://app.netlify.com/sites/electric-next/deploys/670d22b128596700074e7c28
😎 Deploy Preview https://deploy-preview-1829--electric-next.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

Copy link
Contributor

@robacourt robacourt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat!

@balegas
Copy link
Contributor

balegas commented Oct 10, 2024

I agree with your choices. The only thing is that we can avoid the errors for PKs, but I prefer erroring.

How do we want to handle column names with special names (quoted)? In my opinion the client needs to type them exactly as they are on postgres, otherwise they get a 400 validation error back telling them which columns are invalid, so it should be fairly easy to fix.

Please make sure it is same logic as table names, check with @kevin-dp

Replication publication filtering of columns is available

I think this is an important thing to have by the reasons we've discussed. Since we have a single subscriber and we know all the columns electric needs, maybe PG behaves. Please open a separate ticket to not block merging this one, but I think is one thing we want to validate it's working.

Copy link
Contributor

@kevin-dp kevin-dp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some issues here:

  • PG allows any character in column names if you quote them, e.g. CREATE TABLE foo ("a,column""with-weird name" INT);. Currently, the parser will misinterpret the comma as a separator. So we should allow column names to be quoted (is consistent with table naming parsing)
  • The parser is case sensitive by default. This is inconsistent with schema and table names which are case insensitive unless you quote them.

This would be a good moment to implement #1814 such that we have a parse function that we can use for parsing schema identifiers, table identifiers, and column identifiers.

@kevin-dp kevin-dp self-requested a review October 14, 2024 09:44
Copy link
Contributor

@kevin-dp kevin-dp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well done!
I left some comments mostly asking for minor changes.

packages/sync-service/lib/electric/shapes/shape.ex Outdated Show resolved Hide resolved
packages/sync-service/lib/electric/shapes/shape.ex Outdated Show resolved Hide resolved
packages/typescript-client/src/client.ts Outdated Show resolved Hide resolved
website/electric-api.yaml Outdated Show resolved Hide resolved
Copy link
Contributor

@kevin-dp kevin-dp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Brilliant!
Left a few very minor comments.
Ready to merge for my part 💯

@@ -0,0 +1,50 @@
defmodule Electric.Plug.Utils do
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's perhaps not introduce several utility files but keep everything in the outer utils.ex file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this is not a general utility like the other ones and is more meant to be part of the Plug code - we already have a very long serve shae plug module which defines submodules, the idea is that parsing logic could move to this.

packages/sync-service/lib/electric/plug/utils.ex Outdated Show resolved Hide resolved
packages/sync-service/lib/electric/plug/utils.ex Outdated Show resolved Hide resolved
iex> Electric.Plug.Utils.parse_columns_param(~S|\"fo\"\"o\",bar|)
{:ok, ["bar", ~S|fo"o|]}
iex> Electric.Plug.Utils.parse_columns_param(~S|"id,"name"|)
{:error, ~S|Invalid unquoted identifier contains special characters: "id|}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this saying "unquoted identifier" ? The sigil does contain quotes.
The problem should be that the quote inside the quote is not escaped (with a double quote).

Copy link
Contributor Author

@msfstef msfstef Oct 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The provided column was "id - so I'm treating this as an unquoted identifier that has included a special character "

If the strategy to split the comas was different it would provide id,"name as a quoted identifier and produce a different error

packages/sync-service/lib/electric/postgres/identifiers.ex Outdated Show resolved Hide resolved
packages/sync-service/lib/electric/postgres/identifiers.ex Outdated Show resolved Hide resolved
@msfstef msfstef force-pushed the msfstef/column-filtering branch from 20a81c2 to 7b5aa03 Compare October 14, 2024 13:54
@msfstef msfstef force-pushed the msfstef/column-filtering branch from 5929b95 to adad2d1 Compare October 14, 2024 14:21
@msfstef msfstef merged commit 25c437f into main Oct 14, 2024
24 checks passed
@msfstef msfstef deleted the msfstef/column-filtering branch October 14, 2024 14:47
KyleAMathews pushed a commit that referenced this pull request Nov 1, 2024
Implements #1804


### Overview
- I've picked `columns` as the query parameter name, as `select` might
confuse people and get compared to SQL `SELECT`s, and `filter` is too
close to what the `where` parameter does.
- The columns are comma separated column names, parsed at the validation
stage
- If the parameter is specified, it *has to include* the primary key
columns
- We could make it such that the PK is always included, but it feels
like it might be confusing - explicit is better?
- For tables without a primary key, we treat _all_ columns as primary
keys - we might want to revisit that w.r.t. to this feature
- Columns are validated also by ensuring that all specified column names
match the ones in the schema
- The selected columns are stored in the `Shape` definition as an array
of column names
- We had the choice of potentially filtering the `table_info` data based
on the selection of columns instead of storing them separately, but this
might cause issues with relation change handling and generally I prefer
the definition of the shape to contain all information related to it
(i.e. the schema at the time of creation + what column filters where
applied, rather than compacting the two)
- I modified `Shape` related APIs for converting changes etc to also
apply the column filtering - this is all we need to ensure it works
correctly for replication stream log entries
- In the `Snapshotter`, I modified a `get_column_names` method to also
apply the filtering if present, which takes care of the snapshot log
entries.

### Other changes
- I've changed `Shapes.new` to return the errors along with the field
they are associated with (`:root_table` or `:where` or `:columns`) in
order to return correct validation errors, since all of the stronger
validation occrus at `cast_root_table` when the shape is created and PG
is used but it really validates more than just `root_table`.
- I've updated the client to accept a `columns` argument which is a list
of strings
- I've updated the OpenAPI spec to include the `columns` parameter

### Things to consider
- How do we want to handle column names with special names (quoted)? In
my opinion the client needs to type them exactly as they are on
postgres, otherwise they get a 400 validation error back telling them
which columns are invalid, so it should be fairly easy to fix.
- Replication publication filtering of columns is available, but
[updating it might cause
errors](https://www.postgresql.org/docs/15/logical-replication-col-lists.html#LOGICAL-REPLICATION-COL-LIST-COMBINING)
according to the PG docs, so I'm not sure if that's something we want to
consider
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

API parameter to filter columns shape
4 participants