-
Notifications
You must be signed in to change notification settings - Fork 175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PERF] scan task in memory estimate #1901
Conversation
samster25
commented
Feb 20, 2024
•
edited
Loading
edited
- Closes: Allow daft.read_parquet take in custom statistics from the user #1898
- When column stats are provided, use only the columns in the materialized schema to estimate in memory size
- when column stats are missing, fall back on schema estimate for that field
- When num_rows is provided, use the materialized schema to estimate in memory size
- When neither are provided, estimate the in memory size using an inflation factor (same as our writes) and approximate the number of rows. Then use the materialized schema to estimate in memory size
- thread through the new in memory estimator to the ScanWithTask physical op
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1901 +/- ##
==========================================
+ Coverage 83.93% 85.54% +1.61%
==========================================
Files 55 55
Lines 6112 6228 +116
==========================================
+ Hits 5130 5328 +198
+ Misses 982 900 -82
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems good, should we write some unit tests for these?
* Closes: #1898 1. When column stats are provided, use only the columns in the materialized schema to estimate in memory size * when column stats are missing, fall back on schema estimate for that field 2. When num_rows is provided, use the materialized schema to estimate in memory size 3. When neither are provided, estimate the in memory size using an inflation factor (same as our writes) and approximate the number of rows. Then use the materialized schema to estimate in memory size 4. thread through the new in memory estimator to the ScanWithTask physical op