Implement PG identifier parsing algorithm #1814

kevin-dp · 2024-10-09T06:50:36Z

As proposed by @alco in #1764 (comment):

I would argue against going to Postgres for name lookup. Replicating Postgres' identifier parsing is much simpler because it can be implemented as a pure function. Postgres is open source and it's not like its definition of what is an identifier is going to change any time soon: no DB queries or caching required.

I have to admit that Postgres' definition of the identifier token (as specified in its lexer code) is a bit cryptic: it explicitly mentions the ranges A-Z, a-z and byte values in the range \200-\377 (those are octal numbers, in decimal they are 128 to 255), but I think the intent there is to allow any valid UTF-8 code point beyond the 7-bit ASCII Plane, so not only letters. For example, this works in psql:
=# create table 🪷(id text);
CREATE TABLE
=# \d
        List of relations
 Schema │ Name │ Type  │  Owner   
────────┼──────┼───────┼──────────
 public │ foo¡ │ table │ postgres
 public │ Δ    │ table │ postgres
 public │ 🪷   │ table │ postgres
There is, apparently, a separate validation to determine that the given byte sequence is valid UTF-8 encoding (or any other encoding that the database server is configured to use) because not all sequences of bytes are valid identifiers.

Regardless, the algorithm for lower-casing identifiers is very small and straightforward. Here's its full implementation -
https://github.com/postgres/postgres/blob/259a0a99fe3d45dcf624788c1724d9989f3382dc/src/backend/parser/scansup.c#L55-L73

So it is conceivable that we can implement our own validation of user-provided table identifiers by checking they are valid UTF-8 strings and don't contain certain ASCII characters. Then we can normalize the identifier by implement Postgres' downcasing algorithm to obtain a string that we can use to fetch table info from Postgres and as a cache key for storing table info in memory.

The text was updated successfully, but these errors were encountered:

msfstef · 2024-10-14T11:33:23Z

@alco @kevin-dp the postgres algorithm depends on the server encoding as well - should we grab the encoding in the metadata acquisition stage, or just default to UTF8 for start?

msfstef · 2024-10-14T12:58:04Z

I've implemented a basic Electric.Postgres.Identifier.parse in #1829 to serve the immediate needs of that feature.

Postgres makes use of it's server encoding and max identifier character length configuration, so we'd need to retrieve those as well as part of our "prep" and pass them along to the parsing routine. I will address that in a separate PR, probably overlaps with @alco 's work on putting all prep queries in the replication connection.

alco · 2024-10-14T12:59:24Z

@alco @kevin-dp the postgres algorithm depends on the server encoding as well - should we grab the encoding in the metadata acquisition stage, or just default to UTF8 for start?

In the spirit of "any complex system that works has evolved from a simpler system that worked" let's start with assuming UTF8 first.

msfstef · 2024-10-21T11:24:59Z

Addressing this in #1873 - if we don't want to merge it then I suggest we close this issue

Fixes #1814 We still go to PG to validate the table as we need its OID, so this is not strictly necessary, but since we have validation logic we might as well validate things locally to avoid going to PG for clearly invalid table identifiers? This also fixes the accepted characters by PG for identifiers [based on the docs](https://www.postgresql.org/docs/current/sql-syntax-lexical.html#SQL-SYNTAX-IDENTIFIERS) as it was excluding some unicode chars and dollar signs before.

kevin-dp mentioned this issue Oct 14, 2024

feat: Add columns API query parameter to filter table columns #1829

Merged

msfstef self-assigned this Oct 14, 2024

msfstef mentioned this issue Oct 21, 2024

chore: Validate table names locally first #1873

Merged

msfstef closed this as completed in a8b36ac Oct 22, 2024

msfstef closed this as completed in #1873 Oct 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement PG identifier parsing algorithm #1814

Implement PG identifier parsing algorithm #1814

kevin-dp commented Oct 9, 2024

msfstef commented Oct 14, 2024

msfstef commented Oct 14, 2024

alco commented Oct 14, 2024

msfstef commented Oct 21, 2024

Implement PG identifier parsing algorithm #1814

Implement PG identifier parsing algorithm #1814

Comments

kevin-dp commented Oct 9, 2024

msfstef commented Oct 14, 2024

msfstef commented Oct 14, 2024

alco commented Oct 14, 2024

msfstef commented Oct 21, 2024