-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEAT] Estimate materialized size of ScanTask better from Parquet reads #3302
Conversation
CodSpeed Performance ReportMerging #3302 will improve performances by 42.24%Comparing Summary
Benchmarks breakdown
|
src/daft-scan/src/scan_task_iters.rs
Outdated
@@ -269,6 +270,9 @@ pub(crate) fn split_by_row_groups( | |||
|
|||
*chunk_spec = Some(ChunkSpec::Parquet(curr_row_group_indices)); | |||
*size_bytes = Some(curr_size_bytes as u64); | |||
|
|||
// Re-estimate the size bytes in memory | |||
new_estimated_size_bytes_in_memory = t.estimated_materialized_size_bytes.map(|est| (est as f64 * (curr_num_rows as f64 / file.num_rows as f64)) as usize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kevinzwang could you take a look at this logic for splitting ScanTasks, and trying to correctly predict the resultant estimated materialized size bytes?
Looks like we're doing some crazy stuff wrt modifying the FileMetadata and I couldn't really figure out if it is safe to do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks reasonable to me
@jaychia is this ready for review? Looks like a lot of tests are still failing |
Sorry, thanks for calling me out -- I have to do some more refactors to this PR. Taking this back into draft mode and un-requesting reviews. |
6c7bd68
to
cafe6b3
Compare
2fcee2f
to
e516c54
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3302 +/- ##
==========================================
+ Coverage 77.39% 77.44% +0.05%
==========================================
Files 678 686 +8
Lines 83300 84006 +706
==========================================
+ Hits 64469 65060 +591
- Misses 18831 18946 +115
|
Actually, I'm unhappy with this approach and think we need a more sophisticated approach. Closing this PR and going to start a new one. The problem with the approach in this PR is that it only uses the FileMetadata, which unfortunately doesn't give us a good mechanism of actually figuring out the size of data after both decompression and decoding. More concretely, we need access to some of the data pages (the dictionary page being the most important one) in order to make good decisions. |
Adds better estimation of the materialized bytes in memory for a given Parquet ScanTask.
We do this by making use of the same Parquet metadata that we use for schema inference. We take a look at the metadata and make use of various fields such as the reported uncompressed_size, compressed_size of each column. Then we use these statistics to estimate the materialized size of data for reading a Parquet file, using its size on disk.
TODOs:
compressed -> uncompressed -> decoded
, and I think we still need to account for encoding here when thinking about how much memory this data will take up when decoded into Daft Series.