Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[FEAT] Improved projection folding. (#1374)
While looking at the projection folding behaviour in the old query planner, I noticed there were some projections that could be folded that it didn't fold. For example, (a + 1) -> (a as c) could be folded into (a + 1 as c), but the old query planner does not do that: ``` >>> daft.from_pydict({"a": [1, 2], "b": [3, 4]}).select((col("a")+1)).select(col("a").alias("c")).explain(show_optimized=True) 2023-09-13 19:34:09.494 | INFO | daft.context:runner:87 - Using PyRunner ┌─Projection │ output=[col(a) AS c] │ partitioning={'by': None, 'num_partitions': 1, 'scheme': PartitionScheme.Unknown} │ ├──Projection │ output=[col(a) + lit(1)] │ partitioning={'by': None, 'num_partitions': 1, 'scheme': PartitionScheme.Unknown} │ └──InMemoryScan output=[col(a), col(b)] cache_id='fc55df21728b4bf9bb1d6143559af9d7' partitioning={'by': None, 'num_partitions': 1, 'scheme': PartitionScheme.Unknown} ``` The rules of the old projection folding were that: - A projection P with a child projection C can be folded together if P only references columns that are no computation in C (no-ops/aliases). However, there are projections with computation in C that can be folded into P. Specifically, they can be safely folded if all columns in C are referenced at most once in P. Then, the expressions in C can be directly substituted for the column references in P without changing any execution semantics. This PR implements improved projection folding with this new rule. The new query planner will fold the example above into: ``` * Project: col(a) + lit(1) AS c | * Output schema = a (Int64) ``` --------- Co-authored-by: Xiayue Charles Lin <[email protected]> Co-authored-by: Clark Zinzow <[email protected]>
- Loading branch information