Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PERF] Set min_partitions for post aggregation shuffles #1861

Merged
merged 5 commits into from
Feb 13, 2024

Conversation

colin-ho
Copy link
Contributor

@colin-ho colin-ho commented Feb 9, 2024

Added a config variable to set the minimum partitions produced after first stage aggregations.

Here's a comparison between shuffling with same number of input partitions vs a much smaller number

Screenshot 2024-02-09 at 4 40 08 PM Screenshot 2024-02-09 at 4 41 14 PM

@@ -51,6 +52,7 @@ impl Default for DaftExecutionConfig {
parquet_inflation_factor: 3.0,
csv_target_filesize: 512 * 1024 * 1024, // 512MB
csv_inflation_factor: 0.5,
aggregation_min_partitions: 10,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this too small?

Copy link

codecov bot commented Feb 9, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (adac24c) 85.50% compared to head (2cb12d2) 85.55%.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1861      +/-   ##
==========================================
+ Coverage   85.50%   85.55%   +0.04%     
==========================================
  Files          55       55              
  Lines        6194     6194              
==========================================
+ Hits         5296     5299       +3     
+ Misses        898      895       -3     
Files Coverage Δ
daft/context.py 76.22% <ø> (ø)

... and 2 files with indirect coverage changes

daft/context.py Outdated
@@ -215,6 +215,7 @@ def set_execution_config(
parquet_inflation_factor: float | None = None,
csv_target_filesize: int | None = None,
csv_inflation_factor: float | None = None,
aggregation_min_partitions: int | None = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should rename this to shuffle_aggregation_default_partitions and the default value should be 200 to mimic spark. Also this should be in the planning config!

@colin-ho colin-ho requested a review from samster25 February 13, 2024 01:16
@colin-ho colin-ho force-pushed the colin/reduce-shuffle-partitions-after-1st-agg branch from ec83263 to 151931c Compare February 13, 2024 02:20
Copy link
Member

@samster25 samster25 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@samster25 samster25 merged commit c416718 into main Feb 13, 2024
42 checks passed
@samster25 samster25 deleted the colin/reduce-shuffle-partitions-after-1st-agg branch February 13, 2024 03:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants