[FEAT] Filter predicates in SQL join #3371

kevinzwang · 2024-11-20T23:50:04Z

Adds support for things like:

SELECT * FROM a JOIN b ON a.x = b.x AND a.y > 0

Enables TPC-H q13

codspeed-hq · 2024-11-20T23:59:29Z

CodSpeed Performance Report

Merging #3371 will not alter performance

_{Comparing kevin/sql-join-on-filter (c018163) with main (5fee192)}

Summary

✅ 17 untouched benchmarks

kevinzwang · 2024-11-21T00:14:24Z

src/daft-sql/src/planner.rs

-            let mut rel = self.new_with_context().plan_relation(&first.relation)?;
+            let mut rel = self.plan_relation(&first.relation)?;
            self.table_map.insert(rel.get_name(), rel.clone());
            for tbl in from_iter {
-                let right = self.new_with_context().plan_relation(&tbl.relation)?;
+                let right = self.plan_relation(&tbl.relation)?;


decided revert my new_with_context here because SQLPlanner.plan_relation does not mutate the plan nor does it use the alias map. Subquery relations are already handled in plan_relation with a new_with_context so this is redundant.

kevinzwang · 2024-11-21T00:18:00Z

src/daft-sql/src/planner.rs

+        macro_rules! return_non_ident_errors {
+            ($e:expr) => {
+                if !matches!(
+                    $e,
+                    PlannerError::ColumnNotFound { .. } | PlannerError::TableNotFound { .. }
+                ) {
+                    return Err($e);
                }
-                // only one is fully qualified: `join on x = b.y`
-                ([Ident{value: col_a, ..}], [tbl_b, Ident{value: col_b, ..}]) => {
-                    if tbl_b.value == right_rel.get_name() {
-                        (col_a.clone(), col_b.clone())
-                    } else if tbl_b.value == left_rel.get_name() {
-                        (col_b.clone(), col_a.clone())
-                    } else {
-                        unsupported_sql_err!("Could not determine which table the identifiers belong to")
-                    }
+            };


I use a macro here because a function would have to own the error object to return it, and I would like to use the error in later parts of the code if it is not returned.

universalmind303 · 2024-11-21T00:25:12Z

@kevinzwang wouldn't this be better to handle directly in the logical plan? That would allow us to support non equi joins in the dataframe api as well.

ex

df1.join(df2, on=(df1["a"] == df2["a"] & df2["b"] > 0))

codecov · 2024-11-21T00:33:42Z

Codecov Report

Attention: Patch coverage is 87.80488% with 15 lines in your changes missing coverage. Please review.

Project coverage is 77.45%. Comparing base (b6695eb) to head (c018163).
Report is 4 commits behind head on main.

Files with missing lines	Patch %	Lines
src/daft-sql/src/planner.rs	87.70%	15 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3371      +/-   ##
==========================================
+ Coverage   77.39%   77.45%   +0.05%     
==========================================
  Files         678      678              
  Lines       83300    83278      -22     
==========================================
+ Hits        64469    64501      +32     
+ Misses      18831    18777      -54

Files with missing lines	Coverage Δ
src/daft-sql/src/lib.rs	`100.00% <100.00%> (ø)`
src/daft-sql/src/planner.rs	`71.81% <87.70%> (+1.17%)`	⬆️

... and 3 files with indirect coverage changes

---- 🚨 Try these New Features:

Flaky Tests Detection - Detect and resolve failed and flaky tests

graphite-app · 2024-11-21T00:35:13Z

Graphite Automations

"Request reviewers once CI passes" took an action on this PR • (11/21/24)

1 reviewer was added to this PR based on Andrew Gazelka's automation.

kevinzwang · 2024-11-21T00:35:21Z

@kevinzwang wouldn't this be better to handle directly in the logical plan? That would allow us to support non equi joins in the dataframe api as well.

ex
df1.join(df2, on=(df1["a"] == df2["a"] & df2["b"] > 0))

I agree. However, we would potentially need to have a concept of table-associated columns in our logical plan as well as make changes to our join op struct.

I think we should definitely do it in the future, but I didn't want to broaden the scope of this PR. Additionally, I think we would still need to have SQL-specific logic to identify the join keys for each side in a query, so a lot of the work in this PR is still relevant.

Let me know what you think!

universalmind303 · 2024-11-21T01:54:25Z

@kevinzwang wouldn't this be better to handle directly in the logical plan? That would allow us to support non equi joins in the dataframe api as well.
ex
df1.join(df2, on=(df1["a"] == df2["a"] & df2["b"] > 0))
I agree. However, we would potentially need to have a concept of table-associated columns in our logical plan as well as make changes to our join op struct.

I think we should definitely do it in the future, but I didn't want to broaden the scope of this PR. Additionally, I think we would still need to have SQL-specific logic to identify the join keys for each side in a query, so a lot of the work in this PR is still relevant.

Let me know what you think!

I think we would still need to have SQL-specific logic to identify the join keys for each side in a query, so a lot of the work in this PR is still relevant.

that sounds good for now. I think an interesting dataframe case is df1.join(df2, on=(df1["a"] == df2["a"] & df2["b"] > 0)). How would the dataframe side identify df1.a vs df2.a? Wouldn't we need similar logic to handle this case?

Additionally, how would you express this using the dsl col syntax instead of bracket notation. col("df1.a")? or df1.col("a")

I think we should definitely do it in the future, but I didn't want to broaden the scope of this PR.

I'll open up an issue for dataframe non-equi joins just so we don't lose this!

kevinzwang · 2024-11-21T05:53:58Z

that sounds good for now. I think an interesting dataframe case is df1.join(df2, on=(df1["a"] == df2["a"] & df2["b"] > 0)). How would the dataframe side identify df1.a vs df2.a? Wouldn't we need similar logic to handle this case?

DataFusion has a concept of a table reference in a column expression and I think we could something similar if you use df["a"] notation

[FEAT] Filter predicates in SQL join

fb83763

github-actions bot added the enhancement New feature or request label Nov 20, 2024

add and fix tests

c018163

kevinzwang requested a review from universalmind303 November 21, 2024 00:12

kevinzwang marked this pull request as ready for review November 21, 2024 00:12

kevinzwang commented Nov 21, 2024

View reviewed changes

universalmind303 mentioned this pull request Nov 21, 2024

Dataframe non-equi joins #3380

Open

universalmind303 approved these changes Nov 21, 2024

View reviewed changes

universalmind303 merged commit 2c0f3cd into main Nov 21, 2024
46 checks passed

universalmind303 deleted the kevin/sql-join-on-filter branch November 21, 2024 02:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEAT] Filter predicates in SQL join #3371

[FEAT] Filter predicates in SQL join #3371

kevinzwang commented Nov 20, 2024 •

edited

Loading

codspeed-hq bot commented Nov 20, 2024 •

edited

Loading

kevinzwang Nov 21, 2024

kevinzwang Nov 21, 2024

universalmind303 commented Nov 21, 2024 •

edited

Loading

codecov bot commented Nov 21, 2024 •

edited

Loading

graphite-app bot commented Nov 21, 2024

kevinzwang commented Nov 21, 2024 •

edited

Loading

universalmind303 commented Nov 21, 2024

kevinzwang commented Nov 21, 2024

[FEAT] Filter predicates in SQL join #3371

[FEAT] Filter predicates in SQL join #3371

Conversation

kevinzwang commented Nov 20, 2024 • edited Loading

codspeed-hq bot commented Nov 20, 2024 • edited Loading

CodSpeed Performance Report

Merging #3371 will not alter performance

Summary

kevinzwang Nov 21, 2024

Choose a reason for hiding this comment

kevinzwang Nov 21, 2024

Choose a reason for hiding this comment

universalmind303 commented Nov 21, 2024 • edited Loading

codecov bot commented Nov 21, 2024 • edited Loading

Codecov Report

graphite-app bot commented Nov 21, 2024

Graphite Automations

kevinzwang commented Nov 21, 2024 • edited Loading

universalmind303 commented Nov 21, 2024

kevinzwang commented Nov 21, 2024

kevinzwang commented Nov 20, 2024 •

edited

Loading

codspeed-hq bot commented Nov 20, 2024 •

edited

Loading

universalmind303 commented Nov 21, 2024 •

edited

Loading

codecov bot commented Nov 21, 2024 •

edited

Loading

kevinzwang commented Nov 21, 2024 •

edited

Loading