-
Notifications
You must be signed in to change notification settings - Fork 174
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[FEAT] [Join Optimizations] Add sort-merge join. (#1755)
This PR adds a sort-merge join implementation as a new join strategy, where each side of the join is sorted on the join keys, and then the sorted tables are merged. The sort-merge join strategy is chosen automatically by the query planner if it is expected to be faster than the hash join and the broadcast join. Similar to Spark's ability to specify a join algorithm hint, this PR also exposes a new optional `strategy` arg for `df.join()`, which allows users (and our Python-level tests) to manually specify a join algorithm; currently `"hash"`, `"sort_merge"`, and `"broadcast"` are supported, with the default `None` resulting in the query planner choosing a join algorithm automatically. ```python df = left.join(right, on="foo", strategy="sort_merge") ``` Closes #1776 ## Query Planning The query planner chooses the sort-merge join as its join strategy if the larger side of the join is range-partitioned, or if the smaller side of the join is range-partitioned and the larger side is not partitioned. In the future, we will want to do a sort-merge join: 1. If a downstream operation requires the table to be sorted on the join key. 2. If neither sides of the join are partitioned AND we determine that the sort-merge join is faster on unpartitioned data the the hash join (pending benchmarking). **NOTE:** We currently only support sort-merge joins on primitive join keys, see TODO below. ## Query Scheduling ### Full sort-merge join All partitions for both sides of the join are materialized, upon which we calculate sort boundaries on samples from both sides of the join. These combined sort boundaries are used to sort each side of the join. Once each side is sorted with the same sort boundaries and the same number of partitions, we perform a merge join, which merges overlapping pairs of partitions. ### Sort-eliding merge-join for presorted (but unaligned) dataframes Merge join is performed on the two sorted sides of the join, which merges overlapping pairs of partitions. The partition boundaries of the two sides of the join do not need to line up; partition overlap will be determined in the scheduler and merge-join tasks will be emitted for overlapping partitions. ## TODOs - [x] Test coverage - [x] Fix dispatch logic for adjacent partitions with intersecting bounds (broadcast on matching bounds instead of naive zip). - [x] Add support for sort-merge joins on sorted dataframes with unaligned partition boundaries. - [ ] Add support for non-primitive join keys. - [ ] Benchmarking to validate + tweak the heuristics used by the query planner to choose whether to use the sort-merge join.
- Loading branch information
1 parent
6cda37a
commit 2eefca6
Showing
34 changed files
with
2,119 additions
and
154 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.