-
Notifications
You must be signed in to change notification settings - Fork 175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEAT]: Support .clip
function
#3136
Conversation
Hey @colin-ho, I've made a rough draft of the PR (not complete yet: still need to add tests), functionality seems correct though. Some things I'd like to ask for direction on:
|
CodSpeed Performance ReportMerging #3136 will degrade performances by 21.7%Comparing Summary
Benchmarks breakdown
|
|
Hey @colin-ho, have made the requested changes:
Could I ask for your thoughts on these questions:
|
binary_min
, binary_max
and clip
Series functions.clip
function
Let's stick with just numeric types for this PR |
src/daft-core/src/array/ops/clip.rs
Outdated
fn clamp_helper<T: PartialOrd + Copy>( | ||
value: Option<&T>, | ||
left_bound: Option<&T>, | ||
right_bound: Option<&T>, | ||
) -> Option<T> { | ||
match (value, left_bound, right_bound) { | ||
(None, _, _) => None, | ||
(Some(v), Some(l), Some(r)) => { | ||
assert!(l <= r, "Left bound is greater than right bound"); | ||
Some(clamp(*v, *l, *r)) | ||
} | ||
(Some(v), Some(l), None) => Some(clamp_min(*v, *l)), | ||
(Some(v), None, Some(r)) => Some(clamp_max(*v, *r)), | ||
(Some(v), None, None) => Some(*v), | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They key observation we can leverage here for some better performance is that the result of the clamp is None if the original value is None. Therefore instead of doing as_arrow.iter()
you can use as_arrow.values_iter()
, which will return an iterator of all the values, ignoring the validity. This is fine because we slap on the validity of the original array anyway. The very small benefit of this is that it will reduce the number of match branches, i think only by 1 or something.
Unfortunately we can't do this for left and right though, because we need to account for their validity.
But in a case like (array_size, 1, rbound_size)
and the single left_bound is not None, you only need 1 validity check per row! i.e. for the right_bound (because you are using values_iter
for the array, and your left bound is a non-null scalar).
Lastly, and probably the most important, in the case of (_, 1, 1)
you can probably do something like
let left = left_bound.get(0);
let right = right_bound.get(0);
if let Some(left) = left
&& let Some(right) = right
{
self.apply(|v| clamp(v, left, right))
} else if let Some(left) = left {
self.apply(|v| clamp_min(v, left))
} else if let Some(right) = right {
self.apply(|v| clamp_max(v, right))
} else {
Ok(Self::full_null(self.name(), self.data_type(), self.len()))
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense.
I've removed the clamp_helper
functions and re-wrote the function that I pass into the map
to reduce the amount of unneeded match
arms for the various cases. Maybe a macro would be good, but I think this is clear enough for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good so far!
if !array_field.dtype.is_numeric() { | ||
return Err(DaftError::TypeError(format!( | ||
"Expected array input to be numeric, got {}", | ||
array_field.dtype | ||
))); | ||
} | ||
|
||
// Check if min_field and max_field are numeric or null | ||
if !(min_field.dtype.is_numeric() || min_field.dtype == DataType::Null) { | ||
return Err(DaftError::TypeError(format!( | ||
"Expected min input to be numeric or null, got {}", | ||
min_field.dtype | ||
))); | ||
} | ||
|
||
if !(max_field.dtype.is_numeric() || max_field.dtype == DataType::Null) { | ||
return Err(DaftError::TypeError(format!( | ||
"Expected max input to be numeric or null, got {}", | ||
max_field.dtype | ||
))); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should be consolidated in InferDataType
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved the validation logic to clip_op
, now it throws an error if any of these conditions are violated.
daft/expressions/expressions.py
Outdated
@@ -623,6 +623,19 @@ def floor(self) -> Expression: | |||
expr = native.floor(self._expr) | |||
return Expression._from_pyexpr(expr) | |||
|
|||
def clip(self, min: Expression, max: Expression) -> Expression: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Allow Expression | None = None
as the arguments instead
src/daft-core/src/array/ops/clip.rs
Outdated
} | ||
} | ||
|
||
macro_rules! create_data_array { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see much benefit in this macro, since the amount of lines covered is pretty minimal.
.then(|| create_null_series(max.name())) | ||
.unwrap_or_else(|| max.clone()); | ||
|
||
match &output_type { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Try:
output_type if output_type.is_numeric() => {
with_match_numeric_daft_types!(output_type, |$T| {
let self_casted = self.cast(output_type)?;
let min_casted = min.cast(output_type)?;
let max_casted = max.cast(output_type)?;
let self_downcasted = self_casted.downcast::<<$T as DaftDataType>::ArrayType>()?;
let min_downcasted = min_casted.downcast::<<$T as DaftDataType>::ArrayType>()?;
let max_downcasted = max_casted.downcast::<<$T as DaftDataType>::ArrayType>()?;
Ok(self_downcasted.clip(min_downcasted, max_downcasted)?.into_series())
})
}
instead, which fits in a little better with our codebase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the tip! Changed it to use this macro instead.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3136 +/- ##
==========================================
+ Coverage 77.35% 77.52% +0.17%
==========================================
Files 684 690 +6
Lines 83627 84699 +1072
==========================================
+ Hits 64688 65664 +976
- Misses 18939 19035 +96
|
hey @colin-ho, sorry for taking a while (finals q_q) have made the requested changes! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, I just updated with some simple clean ups. Looks good to me, thanks @conradsoon !
Closes #1907.