[FEAT] Add group_by.map_groups #1825

colin-ho · 2024-01-29T23:48:38Z

Added the map_groups method which allows for custom aggregation logic using a UDF in a group_by context. Usage: df.group_by('group').map_groups(udf(col('a'), col('b')))

Changes:

Added map_groups API
Added map_groups expression
Implemented map_groups logic
Added tests

codecov · 2024-01-29T23:58:56Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (6cda37a) 84.98% compared to head (d22622a) 85.62%.
Report is 3 commits behind head on main.

❗ Current head d22622a differs from pull request most recent head 5a8c4c9. Consider uploading reports for the commit 5a8c4c9 to get more accurate results

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1825      +/-   ##
==========================================
+ Coverage   84.98%   85.62%   +0.64%     
==========================================
  Files          55       55              
  Lines        5734     6088     +354     
==========================================
+ Hits         4873     5213     +340     
- Misses        861      875      +14

Files	Coverage Δ
daft/dataframe/dataframe.py	`88.36% <100.00%> (+0.12%)`	⬆️
daft/logical/builder.py	`89.91% <100.00%> (+0.35%)`	⬆️

... and 10 files with indirect coverage changes

jaychia

Looks pretty good! Let's add some docs as well

jaychia · 2024-01-31T21:08:29Z

daft/dataframe/dataframe.py

+
+    def map_groups(self, udf: Expression) -> "DataFrame":
+        """Apply a user-defined function to each group.
+


Let's add a nice Examples section here as well for documentation?

jaychia · 2024-01-31T21:42:53Z

src/daft-dsl/src/expr.rs

@@ -85,6 +89,7 @@ impl AggExpr {
            | Max(expr)
            | List(expr)
            | Concat(expr) => expr.name(),
+            MapGroups { func: _, inputs } => inputs.first().unwrap().name(),


Interesting... I see no alternative to this logic here but it also feels odd to me that the name of the result expr is just the name of the first expr. I guess this is the same behavior as a normal UDF right now.

Perhaps this is another argument in favor of getting rid of default behavior for expression naming.

added some documentation in the docstring to call out that column name will default to name of first input.

I also realized that users may try to do a .alias() on the first input column to customize the column name, so I made a slight change in latest commit to enable this (added a test for this as well).

will merge if you're good with this!

jaychia · 2024-01-31T21:49:56Z

src/daft-dsl/src/expr.rs

        }
    }

-    pub fn child(&self) -> ExprRef {
+    pub fn child(&self) -> Vec<ExprRef> {


More accurately now pub fn children(...)?

jaychia · 2024-01-31T21:58:45Z

src/daft-dsl/src/expr.rs

@@ -131,7 +137,8 @@ impl AggExpr {
            | Min(expr)
            | Max(expr)
            | List(expr)
-            | Concat(expr) => expr.clone(),
+            | Concat(expr) => vec![expr.clone()],
+            MapGroups { func: _, inputs } => inputs.iter().map(|e| e.clone().into()).collect(),


Nit: I think you can just do inputs.clone()

hmmm, doing inputs.clone() gives an error:

match arms have incompatible types
expected struct Vec<Arc<expr::Expr>>
found struct `Vecexpr::Expr

I'll just stick to this since the impl for FunctionExpr also does the same

Oh! Interesting didn't realize we weren't using ExprRefs

Makes sense.

src/daft-table/src/ops/agg.rs

initial impl

cbe363e

colin-ho changed the title ~~[FEAT] Add map_groups method~~ [FEAT] Add group_by.map_groups Jan 29, 2024

github-actions bot added the enhancement New feature or request label Jan 29, 2024

refactor lil bit

53d36c2

colin-ho requested review from samster25, clarkzinzow and jaychia January 30, 2024 00:30

jaychia approved these changes Jan 31, 2024

View reviewed changes

colin-ho added 2 commits February 1, 2024 10:53

pr comments

d22622a

updates docs and test

5a8c4c9

colin-ho merged commit ac9ffcf into main Feb 5, 2024
40 checks passed

colin-ho deleted the colin/map_groups branch February 5, 2024 12:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEAT] Add group_by.map_groups #1825

[FEAT] Add group_by.map_groups #1825

colin-ho commented Jan 29, 2024 •

edited

Loading

codecov bot commented Jan 29, 2024 •

edited

Loading

jaychia left a comment

jaychia Jan 31, 2024

jaychia Jan 31, 2024

colin-ho Feb 2, 2024 •

edited

Loading

jaychia Jan 31, 2024

jaychia Jan 31, 2024

colin-ho Feb 1, 2024

jaychia Feb 1, 2024


		def map_groups(self, udf: Expression) -> "DataFrame":
		"""Apply a user-defined function to each group.

[FEAT] Add group_by.map_groups #1825

[FEAT] Add group_by.map_groups #1825

Conversation

colin-ho commented Jan 29, 2024 • edited Loading

codecov bot commented Jan 29, 2024 • edited Loading

Codecov Report

jaychia left a comment

Choose a reason for hiding this comment

jaychia Jan 31, 2024

Choose a reason for hiding this comment

jaychia Jan 31, 2024

Choose a reason for hiding this comment

colin-ho Feb 2, 2024 • edited Loading

Choose a reason for hiding this comment

jaychia Jan 31, 2024

Choose a reason for hiding this comment

jaychia Jan 31, 2024

Choose a reason for hiding this comment

colin-ho Feb 1, 2024

Choose a reason for hiding this comment

jaychia Feb 1, 2024

Choose a reason for hiding this comment

colin-ho commented Jan 29, 2024 •

edited

Loading

codecov bot commented Jan 29, 2024 •

edited

Loading

colin-ho Feb 2, 2024 •

edited

Loading