Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT]: Support .clip function #3136

Merged
merged 12 commits into from
Dec 2, 2024
Merged

Conversation

conradsoon
Copy link
Contributor

@conradsoon conradsoon commented Oct 28, 2024

Closes #1907.

@github-actions github-actions bot added the enhancement New feature or request label Oct 28, 2024
@conradsoon conradsoon marked this pull request as draft October 28, 2024 13:42
@conradsoon
Copy link
Contributor Author

conradsoon commented Oct 28, 2024

Hey @colin-ho, I've made a rough draft of the PR (not complete yet: still need to add tests), functionality seems correct though.

Some things I'd like to ask for direction on:

  • I've actually added binary_min and binary_max as functions as well and expressed the clip in terms of these functions. The binary_min and binary_max use Rust's native min and max functions. Does this approach make sense, or should I make clip use Rust's native clamp function?
  • What kind of behaviour do we want when max < min in the case of .clip? I've followed numpy's implementation (and therefore semantics) of having it just result in the array being entirely max, but it seems Rust's native clamp throws an error instead?
  • Should I keep the exposed names as binary_min and binary_max? Or should I follow numpy and keep it as min and max (even though this meaning is kind of overloaded)?

Copy link

codspeed-hq bot commented Oct 28, 2024

CodSpeed Performance Report

Merging #3136 will degrade performances by 21.7%

Comparing conradsoon:feat-clip (ee00546) with main (b5f60e0)

Summary

❌ 1 regressions
✅ 16 untouched benchmarks

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

Benchmark main conradsoon:feat-clip Change
test_iter_rows_first_row[100 Small Files] 298.1 ms 380.7 ms -21.7%

@colin-ho
Copy link
Contributor

Hey @colin-ho, I've made a rough draft of the PR (not complete yet: still need to add tests), functionality seems correct though.

Some things I'd like to ask for direction on:

  • I've actually added binary_min and binary_max as functions as well and expressed the clip in terms of these functions. The binary_min and binary_max use Rust's native min and max functions. Does this approach make sense, or should I make clip use Rust's native clamp function?
  • What kind of behaviour do we want when max < min in the case of .clip? I've followed numpy's implementation (and therefore semantics) of having it just result in the array being entirely max, but it seems Rust's native clamp throws an error instead?
  • Should I keep the exposed names as binary_min and binary_max? Or should I follow numpy and keep it as min and max (even though this meaning is kind of overloaded)?
  • Let's use clamp for simplicity + performance. Performing clamp in a single pass potentially elides the greater than check. Whereas doing min then max will always do both less than and greater than checks for each value.
  • Throw an error if max < min.
  • Ideally we should just expose a single clip expression, but allow flexibility for the user to choose if they want to clip only with an upper bound, only lower bound, or both. (i.e. if upper bound is None, then we just do a min). Also the num_traits crate has clamp, clamp_min, and clamp_max convenience functions that work for partialord.

@conradsoon
Copy link
Contributor Author

conradsoon commented Nov 1, 2024

Hey @colin-ho, have made the requested changes:

  • Now we explicitly throw an error if we try to .clip with a max < min.
  • We perform clamp in a single-pass now, rather than calling max followed by min.
  • Passing None as one of the arguments (or having a null value in one of the rows) results in not bounding for the relevant side.
  • Cleaned up some of the typing logic with num_traits (thanks for the recommendation).

Could I ask for your thoughts on these questions:

  • Should we also support .clip for any datatype whose physical type is clampable (i.e. DateTime, Timestamp)? Or should this be on a case-by-base basis (if so, which types do you think makes sense to support?)
  • Currently, the actual kernel that I use to .clip rows by has some pattern-matching logic where I check the nullity of the min or max bound. I did it this way to support cases where we might want to call .clip with entire columns instead of just single-values (and hence need to support selective bounding depending on row values). Are there any performance concerns to doing this way + is there a better way?

@conradsoon conradsoon changed the title [FEAT]: binary_min, binary_max and clip Series functions [FEAT]: Support .clip function Nov 1, 2024
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Nov 3, 2024
@colin-ho
Copy link
Contributor

colin-ho commented Nov 3, 2024

Hey @colin-ho, have made the requested changes:

  • Now we explicitly throw an error if we try to .clip with a max < min.
  • We perform clamp in a single-pass now, rather than calling max followed by min.
  • Passing None as one of the arguments (or having a null value in one of the rows) results in not bounding for the relevant side.
  • Cleaned up some of the typing logic with num_traits (thanks for the recommendation).

Could I ask for your thoughts on these questions:

  • Should we also support .clip for any datatype whose physical type is clampable (i.e. DateTime, Timestamp)? Or should this be on a case-by-base basis (if so, which types do you think makes sense to support?)
  • Currently, the actual kernel that I use to .clip rows by has some pattern-matching logic where I check the nullity of the min or max bound. I did it this way to support cases where we might want to call .clip with entire columns instead of just single-values (and hence need to support selective bounding depending on row values). Are there any performance concerns to doing this way + is there a better way?

Let's stick with just numeric types for this PR

Comment on lines 8 to 23
fn clamp_helper<T: PartialOrd + Copy>(
value: Option<&T>,
left_bound: Option<&T>,
right_bound: Option<&T>,
) -> Option<T> {
match (value, left_bound, right_bound) {
(None, _, _) => None,
(Some(v), Some(l), Some(r)) => {
assert!(l <= r, "Left bound is greater than right bound");
Some(clamp(*v, *l, *r))
}
(Some(v), Some(l), None) => Some(clamp_min(*v, *l)),
(Some(v), None, Some(r)) => Some(clamp_max(*v, *r)),
(Some(v), None, None) => Some(*v),
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They key observation we can leverage here for some better performance is that the result of the clamp is None if the original value is None. Therefore instead of doing as_arrow.iter() you can use as_arrow.values_iter(), which will return an iterator of all the values, ignoring the validity. This is fine because we slap on the validity of the original array anyway. The very small benefit of this is that it will reduce the number of match branches, i think only by 1 or something.

Unfortunately we can't do this for left and right though, because we need to account for their validity.

But in a case like (array_size, 1, rbound_size) and the single left_bound is not None, you only need 1 validity check per row! i.e. for the right_bound (because you are using values_iter for the array, and your left bound is a non-null scalar).

Lastly, and probably the most important, in the case of (_, 1, 1) you can probably do something like

let left = left_bound.get(0);
let right = right_bound.get(0);
if let Some(left) = left
    && let Some(right) = right
{
    self.apply(|v| clamp(v, left, right))
} else if let Some(left) = left {
    self.apply(|v| clamp_min(v, left))
} else if let Some(right) = right {
    self.apply(|v| clamp_max(v, right))
} else {
    Ok(Self::full_null(self.name(), self.data_type(), self.len()))
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense.

I've removed the clamp_helper functions and re-wrote the function that I pass into the map to reduce the amount of unneeded match arms for the various cases. Maybe a macro would be good, but I think this is clear enough for now.

Copy link
Contributor

@colin-ho colin-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good so far!

Comment on lines 38 to 58
if !array_field.dtype.is_numeric() {
return Err(DaftError::TypeError(format!(
"Expected array input to be numeric, got {}",
array_field.dtype
)));
}

// Check if min_field and max_field are numeric or null
if !(min_field.dtype.is_numeric() || min_field.dtype == DataType::Null) {
return Err(DaftError::TypeError(format!(
"Expected min input to be numeric or null, got {}",
min_field.dtype
)));
}

if !(max_field.dtype.is_numeric() || max_field.dtype == DataType::Null) {
return Err(DaftError::TypeError(format!(
"Expected max input to be numeric or null, got {}",
max_field.dtype
)));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should be consolidated in InferDataType instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved the validation logic to clip_op, now it throws an error if any of these conditions are violated.

@@ -623,6 +623,19 @@ def floor(self) -> Expression:
expr = native.floor(self._expr)
return Expression._from_pyexpr(expr)

def clip(self, min: Expression, max: Expression) -> Expression:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Allow Expression | None = None as the arguments instead

}
}

macro_rules! create_data_array {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see much benefit in this macro, since the amount of lines covered is pretty minimal.

.then(|| create_null_series(max.name()))
.unwrap_or_else(|| max.clone());

match &output_type {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try:

output_type if output_type.is_numeric() => {
    with_match_numeric_daft_types!(output_type, |$T| {
        let self_casted = self.cast(output_type)?;
        let min_casted = min.cast(output_type)?;
        let max_casted = max.cast(output_type)?;

        let self_downcasted = self_casted.downcast::<<$T as DaftDataType>::ArrayType>()?;
        let min_downcasted = min_casted.downcast::<<$T as DaftDataType>::ArrayType>()?;
        let max_downcasted = max_casted.downcast::<<$T as DaftDataType>::ArrayType>()?;
        Ok(self_downcasted.clip(min_downcasted, max_downcasted)?.into_series())
    })
}

instead, which fits in a little better with our codebase.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the tip! Changed it to use this macro instead.

Copy link

codecov bot commented Nov 24, 2024

Codecov Report

Attention: Patch coverage is 91.28440% with 19 lines in your changes missing coverage. Please review.

Project coverage is 77.52%. Comparing base (8052de7) to head (ee00546).
Report is 38 commits behind head on main.

Files with missing lines Patch % Lines
src/daft-functions/src/numeric/clip.rs 75.55% 11 Missing ⚠️
src/daft-core/src/series/ops/clip.rs 75.86% 7 Missing ⚠️
src/daft-core/src/array/ops/clip.rs 98.95% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3136      +/-   ##
==========================================
+ Coverage   77.35%   77.52%   +0.17%     
==========================================
  Files         684      690       +6     
  Lines       83627    84699    +1072     
==========================================
+ Hits        64688    65664     +976     
- Misses      18939    19035      +96     
Files with missing lines Coverage Δ
daft/expressions/expressions.py 93.31% <100.00%> (-0.07%) ⬇️
daft/series.py 89.58% <100.00%> (+0.03%) ⬆️
src/daft-core/src/datatypes/infer_datatype.rs 82.61% <100.00%> (+1.04%) ⬆️
src/daft-core/src/python/series.rs 94.72% <100.00%> (+0.02%) ⬆️
src/daft-core/src/series/ops/mod.rs 100.00% <ø> (ø)
src/daft-functions/src/numeric/mod.rs 83.33% <100.00%> (+0.21%) ⬆️
src/daft-sql/src/lib.rs 100.00% <ø> (ø)
src/daft-sql/src/modules/numeric.rs 83.33% <100.00%> (+0.59%) ⬆️
src/daft-core/src/array/ops/clip.rs 98.95% <98.95%> (ø)
src/daft-core/src/series/ops/clip.rs 75.86% <75.86%> (ø)
... and 1 more

... and 67 files with indirect coverage changes

@conradsoon conradsoon marked this pull request as ready for review November 24, 2024 15:55
@conradsoon
Copy link
Contributor Author

conradsoon commented Nov 24, 2024

hey @colin-ho, sorry for taking a while (finals q_q) have made the requested changes!

@colin-ho colin-ho self-requested a review December 2, 2024 19:34
Copy link
Contributor

@colin-ho colin-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, I just updated with some simple clean ups. Looks good to me, thanks @conradsoon !

@colin-ho colin-ho merged commit acb8118 into Eventual-Inc:main Dec 2, 2024
41 of 42 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[EXPRESSIONS] .clip
2 participants