-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEAT] any_value
groupby aggregation
#1941
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1941 +/- ##
==========================================
- Coverage 85.40% 83.93% -1.47%
==========================================
Files 55 55
Lines 6221 6132 -89
==========================================
- Hits 5313 5147 -166
- Misses 908 985 +77
|
@@ -97,6 +97,14 @@ impl Series { | |||
self.inner.max(groups) | |||
} | |||
|
|||
pub fn any_value( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Zooming out a bit, we shouldn't have to implement this for each for the Arrays separately since this operation only has to work on the validity mask.
You could actually implement it on the series directly.
let mask = self.validity();
let indices = get_idx_from_bitmap(mask, groups);
self.take(indices)
I think count could also be refactored that way.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I implemented this idea for any_value, but the array-level count implementation is used directly in parts of the code (eg. for computing mean) so I think it's still useful to consolidate the logic there
This function is parameterized by
ignore_nulls
, which attempts to find a non-null value in each group when true. However, usage of this parameter in the aggregation function would require some changes toDataFrame._agg()
that I am going to save for later, since these changes will probably not be needed anymore once global expressions can be passed into GroupBy operationsAlso in this PR: fixes to the
count
aggregation function