-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEAT] Support intersect as a DataFrame API #3134
[FEAT] Support intersect as a DataFrame API #3134
Conversation
src/daft-dsl/src/lit.rs
Outdated
StructArray::new(struct_field, values, None).into_series() | ||
} | ||
} | ||
} | ||
|
||
pub fn to_series(&self) -> Series { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previous to_series
doesn't work with Struct, as struct has its own field names but to_series
always generate field with "literal".
src/daft-dsl/src/lit.rs
Outdated
#[cfg(feature = "python")] | ||
DataType::Python => { | ||
use pyo3::prelude::*; | ||
Self::Python(PyObjectWrapper(Python::with_gil(|py| py.None()))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure whether None
is the right choice for Python type.
CodSpeed Performance ReportMerging #3134 will not alter performanceComparing Summary
|
@kevinzwang @universalmind303 appreciated if you can take a look at this. |
51da893
to
f0ed52b
Compare
daft/expressions/expressions.py
Outdated
@@ -133,6 +134,39 @@ def lit(value: object) -> Expression: | |||
return Expression._from_pyexpr(lit_value) | |||
|
|||
|
|||
def zero_lit(dt: DataType) -> Expression: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be useful to expose this to Python side as well.
I can remove this if it's not desired.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @advancedxy, thank you for working on this! Really appreciate the work that you've done.
I don't have specific comments about the code yet, but from a cursory look at the PR, I don't see the need for the zero_lit
expression. If you take a look at our join functions, we use arrow2's build_multi_array_is_equal
to construct an equality check, and that function takes an argument nulls_equal
.
Daft/src/daft-table/src/ops/joins/hash_join.rs
Lines 55 to 60 in 975c09e
let is_equal = build_multi_array_is_equal( | |
lkeys.columns.as_slice(), | |
rkeys.columns.as_slice(), | |
false, | |
false, | |
)?; |
I believe if we propagate a variable to set that, a null-safe join would automatically work, since the hashes used by the probe are already properly constructed for nulls as well.
Could you give that a try?
Hey @kevinzwang kevin, that was exactly my first thought as well. When I created the original issue, I think we can add null equal safe joins first and then leverage that to support the intersect operation. However, passing the parameter from the python side all the way down to the Rust's physical join plan, it seems it might touch a lot of code and I referenced other(a.k.a Spark) query engine's implementation and noticed that null safe equality could be effective rewrote as
Of course, and taking a step back from here, I think we can add null safe equal in joins in the Rust side first, then leverage that to support this PR's intersection operator and then finish Python side's API. How does that sound to you? |
Yep, that sounds like a good plan. Thank you again for taking this on! |
692704e
to
31af0e7
Compare
Hey @kevinzwang @universalmind303 PTAL at this after #3161 is merged, thanks. |
31af0e7
to
8b9fda5
Compare
8b9fda5
to
82e545d
Compare
The CI failure seems unrelated. Close and re-open to trigger a new CI run. |
82e545d
to
dcddfb5
Compare
Since #3161 is merged, I think this is ready for review. |
pub fn intersect(&self, other: &Self, is_all: bool) -> DaftResult<Self> { | ||
let logical_plan: LogicalPlan = | ||
ops::Intersect::try_new(self.plan.clone(), other.plan.clone(), is_all)? | ||
.to_optimized_join()?; | ||
Ok(self.with_new_plan(logical_plan)) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of immediately converting it to a join, I think we should instead defer this until translate
stage.
If we want to have a concept of a logical Intersect
as discussed here, then I don't think we should immediately convert it to a logical join, but only convert it once to a physical join during the logical -> physical translation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should instead defer this until translate stage.
That's definitely an option. And DuckDB does that indeed: https://github.com/duckdb/duckdb/blob/a2dce8b1c9fa6039c82e9a32bfcc4c49b03ca871/src/execution/physical_plan/plan_set_operation.cpp#L53. The problem is like you stated in #3241 (comment), it will make the optimization rules more complex: a lot of rules should be amend to be set-operations aware.
On the other side, Spark doesn't defer this to translate
stage, the intersect/except operators are optimized during the optimization phase, which makes the optimization phase simpler(at least for set-operation handling).
From my perspective, I agree with you that we should convert set operations to join early to avoid complexing optimization phase currently. I just want to decouple the logical into a separate func/struct and make sure it's extensible for long term plan. I can avoid define these operations if that doesn't sound right to you. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, Lets go with what's in here now. We can always revisit it needed 😅
This commit leverages null safe equal support in joins(see #3069 and #3161) to support intersect API.
Partially fixes #3122.