Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider introducing unique expression IDs in Logical/Physical plan #8379

Open
Jefffrey opened this issue Nov 30, 2023 · 9 comments
Open

Consider introducing unique expression IDs in Logical/Physical plan #8379

Jefffrey opened this issue Nov 30, 2023 · 9 comments
Labels
enhancement New feature or request

Comments

@Jefffrey
Copy link
Contributor

Is your feature request related to a problem or challenge?

In Spark, they have a concept of ExprId which is used to uniquely identify named expressions:

https://github.com/apache/spark/blob/9bb358b51e30b5041c0cd20e27cf995aca5ed4c7/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala#L41-L57

/**
 * A globally unique id for a given named expression.
 * Used to identify which attribute output by a relation is being
 * referenced in a subsequent computation.
 *
 * The `id` field is unique within a given JVM, while the `uuid` is used to uniquely identify JVMs.
 */
case class ExprId(id: Long, jvmId: UUID) {

  override def equals(other: Any): Boolean = other match {
    case ExprId(id, jvmId) => this.id == id && this.jvmId == jvmId
    case _ => false
  }

  override def hashCode(): Int = id.hashCode()

}

Is it worth as attempting to introduce something similar in DataFusion?

There are issues being caused by rules in the optimizer comparing directly on column name leading to bugs when duplicate names appear, such as #8374

If during the analysis of a plan we can assign unique numeric IDs for columns, we could check for column equality based on these IDs and not need to compare string names.

The obvious downside would be this seems like a large effort in refactoring, not to mention breaking changes.

Describe the solution you'd like

Consider introduction of unique ID for columns/expressions to potentially simplify optimization/planning code

Describe alternatives you've considered

Don't do this (large refactoring effort? breaking changes?)

Additional context

Just a thought I had bouncing in my head, would appreciate to hear more thoughts on this (even if this seems unfeasible), or if there was already some prior discussion on a similar topic

@Jefffrey Jefffrey added the enhancement New feature or request label Nov 30, 2023
@alamb
Copy link
Contributor

alamb commented Dec 1, 2023

I wonder what "unique" means ? Like every newly created Expr gets some sort of id?

Some examples:

// would expr1 and expr2 have the same id?
let expr1 = col("foo");
let expr2 = col("foo");
// would expr1 and expr2 have the same id?
let expr1 = col("foo");
let expr2 = expr1.clone()

@Jefffrey
Copy link
Contributor Author

Jefffrey commented Dec 1, 2023

Actually I think I was off the mark on what ExprId is intended to do, it seems it would be more useful if there were a new LogicalExpr enum such as AttributeReference, which would refer to an expr from the parent plan by ExprId

Like given a logical plan:

Projection: a.int_col, b.double_col, CAST(a.date_string_col AS Utf8)
  Inner Join: a.int_col = b.int_col
    SubqueryAlias: a
      Projection: alltypes_plain.int_col, alltypes_plain.date_string_col
        Filter: alltypes_plain.id > Int32(1)
          TableScan: alltypes_plain projection=[id, int_col, date_string_col], partial_filters=[alltypes_plain.id > Int32(1)]
    SubqueryAlias: b
      Projection: alltypes_plain.int_col, alltypes_plain.double_col
        Filter: CAST(alltypes_plain.tinyint_col AS Float64) < alltypes_plain.double_col
          TableScan: alltypes_plain projection=[tinyint_col, int_col, double_col], partial_filters=[CAST(alltypes_plain.tinyint_col AS Float64) < alltypes_plain.double_col]

That top level projection has a.int_col as a Column for example, which when turned into physical plan needs to search the parent schema by name

https://github.com/apache/arrow-datafusion/blob/a6e6d3fab083839239ef81cf3a3546dd8929a541/datafusion/core/src/physical_planner.rs#L879-L891

Whereas with exprid's, it could be possible for a.int_col to be an AttributeReference which references the parent expr list to point to which expr it references by id.

And I think each new expr would have a new ID.

Honestly I could be way off the mark here on the usages/benefits of exprid 😅

It's just something I was thinking about, especially in relation to how verbose it can be to check if columns are the same when taking into account table, schema and catalog parts of the identifier for a column

So instead of having to find the original column of a projected column in a logical plan via name during logical optimization/physical planning, could have that done once off in an analyzer rule pass then afterwards use exprids

@alamb
Copy link
Contributor

alamb commented Dec 3, 2023

What you describe seems similar to the usize index in physical_plan Column references. And as I recall @houqp added exactly to handle this case of ambiguous schema 🤔

I believe Postgres has a similar way of addressing fields (it does so by index rather than column name) in its logical exprs.

The downside I remember from working wit postgres was that there were many potential hard to track down issues when the indexes got messed up.

It might be interesting to see what this would look like in DataFusion 🤔

@tv42
Copy link
Contributor

tv42 commented Dec 8, 2023

Related to this, SQLite and Postgres allow duplicate column names in results. From SQLite tests:

SELECT + COUNT( * ) AS col1, - 92 - + - 47 AS col1

Datafusion errors out with

query failed: Error during planning: Projections require unique expression names but the expression "COUNT(*) AS col1" at position 0 and "Int64(-92) - Int64(-47) AS col1" at position 1 have the same name. Consider aliasing ("AS") one of them.

Maybe after/as part of implementing this, Datafusion should relax that restriction?

@alamb
Copy link
Contributor

alamb commented Dec 9, 2023

Maybe after/as part of implementing this, Datafusion should relax that restriction?

One challenge is that most of the arrow-rs functionality in RecordBatch, for example, assumes unique column names (find_by_name for example). I don't think the Arrow spec itself restricts names to be unique, but I think the assumption may run pretty deep in the implementations

@tv42
Copy link
Contributor

tv42 commented Dec 11, 2023

I assume you mean column_by_name, https://docs.rs/datafusion/latest/datafusion/common/arrow/record_batch/struct.RecordBatch.html#method.column_by_name?

Ecosystem survey:

  • SQLite picks the first matching column:

    sqlite> select col1 from (SELECT + COUNT( * ) AS col1, - 92 - + - 47 AS col1);
    1
    
  • Postgres errors at column access time if there are duplicates:

    postgres=# select col1 from (SELECT + COUNT( * ) AS col1, - 92 - + - 47 AS col1) as foo;
    ERROR:  column reference "col1" is ambiguous
    LINE 1: select col1 from (SELECT + COUNT( * ) AS col1, - 92 - + - 47...
                   ^
    

@tv42
Copy link
Contributor

tv42 commented Apr 22, 2024

This might be a duplicate of #6543?

@alamb
Copy link
Contributor

alamb commented Apr 22, 2024

@tv42 I agree it is certainly related

@findepi
Copy link
Member

findepi commented Nov 18, 2024

This would be great item in the broader scope of #12723, which intents to make DataFusion Logical Plans be state of the art.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants