Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement PG identifier parsing algorithm #1814

Closed
kevin-dp opened this issue Oct 9, 2024 · 4 comments · Fixed by #1873
Closed

Implement PG identifier parsing algorithm #1814

kevin-dp opened this issue Oct 9, 2024 · 4 comments · Fixed by #1873
Assignees

Comments

@kevin-dp
Copy link
Contributor

kevin-dp commented Oct 9, 2024

As proposed by @alco in #1764 (comment):

I would argue against going to Postgres for name lookup. Replicating Postgres' identifier parsing is much simpler because it can be implemented as a pure function. Postgres is open source and it's not like its definition of what is an identifier is going to change any time soon: no DB queries or caching required.

I have to admit that Postgres' definition of the identifier token (as specified in its lexer code) is a bit cryptic: it explicitly mentions the ranges A-Z, a-z and byte values in the range \200-\377 (those are octal numbers, in decimal they are 128 to 255), but I think the intent there is to allow any valid UTF-8 code point beyond the 7-bit ASCII Plane, so not only letters. For example, this works in psql:

=# create table 🪷(id text);
CREATE TABLE
=# \d
        List of relations
 Schema │ Name │ Type  │  Owner   
────────┼──────┼───────┼──────────
 public │ foo¡ │ table │ postgres
 public │ Δ    │ table │ postgres
 public │ 🪷   │ table │ postgres

There is, apparently, a separate validation to determine that the given byte sequence is valid UTF-8 encoding (or any other encoding that the database server is configured to use) because not all sequences of bytes are valid identifiers.

Regardless, the algorithm for lower-casing identifiers is very small and straightforward. Here's its full implementation -
https://github.com/postgres/postgres/blob/259a0a99fe3d45dcf624788c1724d9989f3382dc/src/backend/parser/scansup.c#L55-L73

So it is conceivable that we can implement our own validation of user-provided table identifiers by checking they are valid UTF-8 strings and don't contain certain ASCII characters. Then we can normalize the identifier by implement Postgres' downcasing algorithm to obtain a string that we can use to fetch table info from Postgres and as a cache key for storing table info in memory.

@msfstef
Copy link
Contributor

msfstef commented Oct 14, 2024

@alco @kevin-dp the postgres algorithm depends on the server encoding as well - should we grab the encoding in the metadata acquisition stage, or just default to UTF8 for start?

@msfstef
Copy link
Contributor

msfstef commented Oct 14, 2024

I've implemented a basic Electric.Postgres.Identifier.parse in #1829 to serve the immediate needs of that feature.

Postgres makes use of it's server encoding and max identifier character length configuration, so we'd need to retrieve those as well as part of our "prep" and pass them along to the parsing routine. I will address that in a separate PR, probably overlaps with @alco 's work on putting all prep queries in the replication connection.

@alco
Copy link
Member

alco commented Oct 14, 2024

@alco @kevin-dp the postgres algorithm depends on the server encoding as well - should we grab the encoding in the metadata acquisition stage, or just default to UTF8 for start?

In the spirit of "any complex system that works has evolved from a simpler system that worked" let's start with assuming UTF8 first.

@msfstef
Copy link
Contributor

msfstef commented Oct 21, 2024

Addressing this in #1873 - if we don't want to merge it then I suggest we close this issue

KyleAMathews pushed a commit that referenced this issue Nov 1, 2024
Fixes #1814

We still go to PG to validate the table as we need its OID, so this is
not strictly necessary, but since we have validation logic we might as
well validate things locally to avoid going to PG for clearly invalid
table identifiers?

This also fixes the accepted characters by PG for identifiers [based on
the
docs](https://www.postgresql.org/docs/current/sql-syntax-lexical.html#SQL-SYNTAX-IDENTIFIERS)
as it was excluding some unicode chars and dollar signs before.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants