-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEAT] Add group_by.map_groups #1825
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1825 +/- ##
==========================================
+ Coverage 84.98% 85.62% +0.64%
==========================================
Files 55 55
Lines 5734 6088 +354
==========================================
+ Hits 4873 5213 +340
- Misses 861 875 +14
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good! Let's add some docs as well
|
||
def map_groups(self, udf: Expression) -> "DataFrame": | ||
"""Apply a user-defined function to each group. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add a nice Examples section here as well for documentation?
@@ -85,6 +89,7 @@ impl AggExpr { | |||
| Max(expr) | |||
| List(expr) | |||
| Concat(expr) => expr.name(), | |||
MapGroups { func: _, inputs } => inputs.first().unwrap().name(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting... I see no alternative to this logic here but it also feels odd to me that the name of the result expr is just the name of the first expr. I guess this is the same behavior as a normal UDF right now.
Perhaps this is another argument in favor of getting rid of default behavior for expression naming.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added some documentation in the docstring to call out that column name will default to name of first input.
I also realized that users may try to do a .alias()
on the first input column to customize the column name, so I made a slight change in latest commit to enable this (added a test for this as well).
will merge if you're good with this!
src/daft-dsl/src/expr.rs
Outdated
} | ||
} | ||
|
||
pub fn child(&self) -> ExprRef { | ||
pub fn child(&self) -> Vec<ExprRef> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More accurately now pub fn children(...)
?
@@ -131,7 +137,8 @@ impl AggExpr { | |||
| Min(expr) | |||
| Max(expr) | |||
| List(expr) | |||
| Concat(expr) => expr.clone(), | |||
| Concat(expr) => vec![expr.clone()], | |||
MapGroups { func: _, inputs } => inputs.iter().map(|e| e.clone().into()).collect(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I think you can just do inputs.clone()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmm, doing inputs.clone()
gives an error:
match
arms have incompatible types
expected struct Vec<Arc<expr::Expr>>
found struct `Vecexpr::Expr
I'll just stick to this since the impl for FunctionExpr also does the same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh! Interesting didn't realize we weren't using ExprRefs
Makes sense.
Closes #1630
Added the
map_groups
method which allows for custom aggregation logic using a UDF in a group_by context. Usage:df.group_by('group').map_groups(udf(col('a'), col('b')))
Changes: