-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PERF] Set min_partitions for post aggregation shuffles #1861
[PERF] Set min_partitions for post aggregation shuffles #1861
Conversation
src/common/daft-config/src/lib.rs
Outdated
@@ -51,6 +52,7 @@ impl Default for DaftExecutionConfig { | |||
parquet_inflation_factor: 3.0, | |||
csv_target_filesize: 512 * 1024 * 1024, // 512MB | |||
csv_inflation_factor: 0.5, | |||
aggregation_min_partitions: 10, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this too small?
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1861 +/- ##
==========================================
+ Coverage 85.50% 85.55% +0.04%
==========================================
Files 55 55
Lines 6194 6194
==========================================
+ Hits 5296 5299 +3
+ Misses 898 895 -3
|
daft/context.py
Outdated
@@ -215,6 +215,7 @@ def set_execution_config( | |||
parquet_inflation_factor: float | None = None, | |||
csv_target_filesize: int | None = None, | |||
csv_inflation_factor: float | None = None, | |||
aggregation_min_partitions: int | None = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should rename this to shuffle_aggregation_default_partitions
and the default value should be 200 to mimic spark. Also this should be in the planning config!
ec83263
to
151931c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Added a config variable to set the minimum partitions produced after first stage aggregations.
Here's a comparison between shuffling with same number of input partitions vs a much smaller number