-
Notifications
You must be signed in to change notification settings - Fork 175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEAT] [New Query Planner] [2/N] Push partition spec into physical plan, remove Coalesce logical op. #1540
Conversation
The "major changes" sound good to me
Sounds like then split/coalesce is a physical-only concept, and the logical plan only knows the concept of "repartition"?
I do think we will want to add this soon and do "trigger plan optimization + translation to get this answer". Before we have AQE/dynamic repartitioning at runtime, the number of partitions is pretty important to know for performance reasons (e.g. if I have 20,000 partitions, as a user I should be aware and run a repartition). I'm ok throwing a |
Yep, that's correct!
I'll take a look at the minimal work to support this. In terms of redundant work, I think this shouldn't amount to much more than a repeated optimization + plan translation, which is currently very cheap. Once we move globbing to be at planning time, we could find a way to cache that globbing, if needed. |
Actually, we already have optimization + plan translation cached with the Rust-side physical plan scheduler, so we should just be able to expose another |
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## main #1540 +/- ##
=======================================
Coverage 85.20% 85.21%
=======================================
Files 54 54
Lines 5131 5120 -11
=======================================
- Hits 4372 4363 -9
+ Misses 759 757 -2
|
Co-authored-by: Jay Chia <[email protected]>
This PR pushes the partition spec concept into the physical plan; this is required in order to defer partition information gathering (e.g. globbing) until plan translation, and also helps unify some past duplication of partitioning-based logic between the logical plan and logical -> physical translation.
In addition to push all partition spec logic to plan translation and the physical plan, the following major changes were made to facilitate this:
Coalesce
logical op is absorbed into theRepartition
logical op; the switching between aSplit
orCoalesce
is now made at logical -> physical translation time.df.repartition()
now takes an optionalnum_partitions
arg, andnum_partitions
is now optional for the logicalRepartition
op; this is needed to support the "hash repartition by these columns to the same number of partitions as the input" use case, e.g. for writing out partitioned CSV/JSON datasets.df.num_partitions()
has been preserved, by triggering optimization + plan translation to fulfill the query; we can look at caching this in the future, so we don't have to redo globbing once we've moved globbing to plan translation time.df.num_partitions()
API is removed, since this can't be determined based on just the logical plan alone. We could add this back and trigger plan optimization + translation to get this answer, although this will end up triggering nontrivial computation for the globbing backend, so we may want to (1) only trigger enough execution in order to get thenum_partitions
answer, and (2) cache that execution.TODOs Before Merging
DropRepartitions
optimizations as a split between a logical optimization rule (for non-partition spec logic) and a plan translation-time optimization (for partition spec logic).Project
unit tests around partition spec munging to physicalProject
operator