Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] Filter predicates in SQL join #3371

Merged
merged 2 commits into from
Nov 21, 2024
Merged

Conversation

kevinzwang
Copy link
Member

@kevinzwang kevinzwang commented Nov 20, 2024

Adds support for things like:

SELECT * FROM a JOIN b ON a.x = b.x AND a.y > 0

Enables TPC-H q13

@github-actions github-actions bot added the enhancement New feature or request label Nov 20, 2024
Copy link

codspeed-hq bot commented Nov 20, 2024

CodSpeed Performance Report

Merging #3371 will not alter performance

Comparing kevin/sql-join-on-filter (c018163) with main (5fee192)

Summary

✅ 17 untouched benchmarks

@kevinzwang kevinzwang marked this pull request as ready for review November 21, 2024 00:12
Comment on lines -764 to +768
let mut rel = self.new_with_context().plan_relation(&first.relation)?;
let mut rel = self.plan_relation(&first.relation)?;
self.table_map.insert(rel.get_name(), rel.clone());
for tbl in from_iter {
let right = self.new_with_context().plan_relation(&tbl.relation)?;
let right = self.plan_relation(&tbl.relation)?;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

decided revert my new_with_context here because SQLPlanner.plan_relation does not mutate the plan nor does it use the alias map. Subquery relations are already handled in plan_relation with a new_with_context so this is redundant.

Comment on lines +782 to +790
macro_rules! return_non_ident_errors {
($e:expr) => {
if !matches!(
$e,
PlannerError::ColumnNotFound { .. } | PlannerError::TableNotFound { .. }
) {
return Err($e);
}
// only one is fully qualified: `join on x = b.y`
([Ident{value: col_a, ..}], [tbl_b, Ident{value: col_b, ..}]) => {
if tbl_b.value == right_rel.get_name() {
(col_a.clone(), col_b.clone())
} else if tbl_b.value == left_rel.get_name() {
(col_b.clone(), col_a.clone())
} else {
unsupported_sql_err!("Could not determine which table the identifiers belong to")
}
};
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use a macro here because a function would have to own the error object to return it, and I would like to use the error in later parts of the code if it is not returned.

@universalmind303
Copy link
Contributor

universalmind303 commented Nov 21, 2024

@kevinzwang wouldn't this be better to handle directly in the logical plan? That would allow us to support non equi joins in the dataframe api as well.

ex

df1.join(df2, on=(df1["a"] == df2["a"] & df2["b"] > 0))

Copy link

codecov bot commented Nov 21, 2024

Codecov Report

Attention: Patch coverage is 87.80488% with 15 lines in your changes missing coverage. Please review.

Project coverage is 77.45%. Comparing base (b6695eb) to head (c018163).
Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
src/daft-sql/src/planner.rs 87.70% 15 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3371      +/-   ##
==========================================
+ Coverage   77.39%   77.45%   +0.05%     
==========================================
  Files         678      678              
  Lines       83300    83278      -22     
==========================================
+ Hits        64469    64501      +32     
+ Misses      18831    18777      -54     
Files with missing lines Coverage Δ
src/daft-sql/src/lib.rs 100.00% <100.00%> (ø)
src/daft-sql/src/planner.rs 71.81% <87.70%> (+1.17%) ⬆️

... and 3 files with indirect coverage changes

---- 🚨 Try these New Features:

Copy link

graphite-app bot commented Nov 21, 2024

Graphite Automations

"Request reviewers once CI passes" took an action on this PR • (11/21/24)

1 reviewer was added to this PR based on Andrew Gazelka's automation.

@kevinzwang
Copy link
Member Author

kevinzwang commented Nov 21, 2024

@kevinzwang wouldn't this be better to handle directly in the logical plan? That would allow us to support non equi joins in the dataframe api as well.

ex

df1.join(df2, on=(df1["a"] == df2["a"] & df2["b"] > 0))

I agree. However, we would potentially need to have a concept of table-associated columns in our logical plan as well as make changes to our join op struct.

I think we should definitely do it in the future, but I didn't want to broaden the scope of this PR. Additionally, I think we would still need to have SQL-specific logic to identify the join keys for each side in a query, so a lot of the work in this PR is still relevant.

Let me know what you think!

@universalmind303
Copy link
Contributor

@kevinzwang wouldn't this be better to handle directly in the logical plan? That would allow us to support non equi joins in the dataframe api as well.
ex

df1.join(df2, on=(df1["a"] == df2["a"] & df2["b"] > 0))

I agree. However, we would potentially need to have a concept of table-associated columns in our logical plan as well as make changes to our join op struct.

I think we should definitely do it in the future, but I didn't want to broaden the scope of this PR. Additionally, I think we would still need to have SQL-specific logic to identify the join keys for each side in a query, so a lot of the work in this PR is still relevant.

Let me know what you think!

I think we would still need to have SQL-specific logic to identify the join keys for each side in a query, so a lot of the work in this PR is still relevant.

that sounds good for now. I think an interesting dataframe case is df1.join(df2, on=(df1["a"] == df2["a"] & df2["b"] > 0)). How would the dataframe side identify df1.a vs df2.a? Wouldn't we need similar logic to handle this case?

Additionally, how would you express this using the dsl col syntax instead of bracket notation. col("df1.a")? or df1.col("a")

I think we should definitely do it in the future, but I didn't want to broaden the scope of this PR.

I'll open up an issue for dataframe non-equi joins just so we don't lose this!

@universalmind303 universalmind303 merged commit 2c0f3cd into main Nov 21, 2024
46 checks passed
@universalmind303 universalmind303 deleted the kevin/sql-join-on-filter branch November 21, 2024 02:21
@kevinzwang
Copy link
Member Author

that sounds good for now. I think an interesting dataframe case is df1.join(df2, on=(df1["a"] == df2["a"] & df2["b"] > 0)). How would the dataframe side identify df1.a vs df2.a? Wouldn't we need similar logic to handle this case?

DataFusion has a concept of a table reference in a column expression and I think we could something similar if you use df["a"] notation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants