Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] Add group_by.map_groups #1825

Merged
merged 4 commits into from
Feb 5, 2024
Merged

[FEAT] Add group_by.map_groups #1825

merged 4 commits into from
Feb 5, 2024

Conversation

colin-ho
Copy link
Contributor

@colin-ho colin-ho commented Jan 29, 2024

Closes #1630

Added the map_groups method which allows for custom aggregation logic using a UDF in a group_by context. Usage: df.group_by('group').map_groups(udf(col('a'), col('b')))

Changes:

  • Added map_groups API
  • Added map_groups expression
  • Implemented map_groups logic
  • Added tests

@colin-ho colin-ho changed the title [FEAT] Add map_groups method [FEAT] Add group_by.map_groups Jan 29, 2024
@github-actions github-actions bot added the enhancement New feature or request label Jan 29, 2024
Copy link

codecov bot commented Jan 29, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (6cda37a) 84.98% compared to head (d22622a) 85.62%.
Report is 3 commits behind head on main.

❗ Current head d22622a differs from pull request most recent head 5a8c4c9. Consider uploading reports for the commit 5a8c4c9 to get more accurate results

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1825      +/-   ##
==========================================
+ Coverage   84.98%   85.62%   +0.64%     
==========================================
  Files          55       55              
  Lines        5734     6088     +354     
==========================================
+ Hits         4873     5213     +340     
- Misses        861      875      +14     
Files Coverage Δ
daft/dataframe/dataframe.py 88.36% <100.00%> (+0.12%) ⬆️
daft/logical/builder.py 89.91% <100.00%> (+0.35%) ⬆️

... and 10 files with indirect coverage changes

Copy link
Contributor

@jaychia jaychia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good! Let's add some docs as well


def map_groups(self, udf: Expression) -> "DataFrame":
"""Apply a user-defined function to each group.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a nice Examples section here as well for documentation?

@@ -85,6 +89,7 @@ impl AggExpr {
| Max(expr)
| List(expr)
| Concat(expr) => expr.name(),
MapGroups { func: _, inputs } => inputs.first().unwrap().name(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting... I see no alternative to this logic here but it also feels odd to me that the name of the result expr is just the name of the first expr. I guess this is the same behavior as a normal UDF right now.

Perhaps this is another argument in favor of getting rid of default behavior for expression naming.

Copy link
Contributor Author

@colin-ho colin-ho Feb 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added some documentation in the docstring to call out that column name will default to name of first input.

I also realized that users may try to do a .alias() on the first input column to customize the column name, so I made a slight change in latest commit to enable this (added a test for this as well).

will merge if you're good with this!

}
}

pub fn child(&self) -> ExprRef {
pub fn child(&self) -> Vec<ExprRef> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More accurately now pub fn children(...)?

@@ -131,7 +137,8 @@ impl AggExpr {
| Min(expr)
| Max(expr)
| List(expr)
| Concat(expr) => expr.clone(),
| Concat(expr) => vec![expr.clone()],
MapGroups { func: _, inputs } => inputs.iter().map(|e| e.clone().into()).collect(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I think you can just do inputs.clone()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm, doing inputs.clone() gives an error:

match arms have incompatible types
expected struct Vec<Arc<expr::Expr>>
found struct `Vecexpr::Expr

I'll just stick to this since the impl for FunctionExpr also does the same

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh! Interesting didn't realize we weren't using ExprRefs

Makes sense.

src/daft-table/src/ops/agg.rs Outdated Show resolved Hide resolved
@colin-ho colin-ho merged commit ac9ffcf into main Feb 5, 2024
40 checks passed
@colin-ho colin-ho deleted the colin/map_groups branch February 5, 2024 12:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add APIs like df.groupby.apply(udf) or df.apply(udf) that give greater freedom
2 participants