-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider introducing unique expression IDs in Logical/Physical plan #8379
Comments
I wonder what "unique" means ? Like every newly created Some examples: // would expr1 and expr2 have the same id?
let expr1 = col("foo");
let expr2 = col("foo"); // would expr1 and expr2 have the same id?
let expr1 = col("foo");
let expr2 = expr1.clone() |
Actually I think I was off the mark on what Like given a logical plan:
That top level projection has Whereas with exprid's, it could be possible for And I think each new expr would have a new ID. Honestly I could be way off the mark here on the usages/benefits of exprid 😅 It's just something I was thinking about, especially in relation to how verbose it can be to check if columns are the same when taking into account table, schema and catalog parts of the identifier for a column
So instead of having to find the original column of a projected column in a logical plan via name during logical optimization/physical planning, could have that done once off in an analyzer rule pass then afterwards use exprids |
What you describe seems similar to the I believe Postgres has a similar way of addressing fields (it does so by index rather than column name) in its logical exprs. The downside I remember from working wit postgres was that there were many potential hard to track down issues when the indexes got messed up. It might be interesting to see what this would look like in DataFusion 🤔 |
Related to this, SQLite and Postgres allow duplicate column names in results. From SQLite tests: SELECT + COUNT( * ) AS col1, - 92 - + - 47 AS col1 Datafusion errors out with
Maybe after/as part of implementing this, Datafusion should relax that restriction? |
One challenge is that most of the arrow-rs functionality in RecordBatch, for example, assumes unique column names ( |
I assume you mean Ecosystem survey:
|
This might be a duplicate of #6543? |
@tv42 I agree it is certainly related |
This would be great item in the broader scope of #12723, which intents to make DataFusion Logical Plans be state of the art. |
Is your feature request related to a problem or challenge?
In Spark, they have a concept of
ExprId
which is used to uniquely identify named expressions:https://github.com/apache/spark/blob/9bb358b51e30b5041c0cd20e27cf995aca5ed4c7/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala#L41-L57
Is it worth as attempting to introduce something similar in DataFusion?
There are issues being caused by rules in the optimizer comparing directly on column name leading to bugs when duplicate names appear, such as #8374
If during the analysis of a plan we can assign unique numeric IDs for columns, we could check for column equality based on these IDs and not need to compare string names.
The obvious downside would be this seems like a large effort in refactoring, not to mention breaking changes.
Describe the solution you'd like
Consider introduction of unique ID for columns/expressions to potentially simplify optimization/planning code
Describe alternatives you've considered
Don't do this (large refactoring effort? breaking changes?)
Additional context
Just a thought I had bouncing in my head, would appreciate to hear more thoughts on this (even if this seems unfeasible), or if there was already some prior discussion on a similar topic
The text was updated successfully, but these errors were encountered: