Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] Estimate materialized size of ScanTask better from Parquet reads #3302

Closed
wants to merge 10 commits into from

Conversation

jaychia
Copy link
Contributor

@jaychia jaychia commented Nov 15, 2024

Adds better estimation of the materialized bytes in memory for a given Parquet ScanTask.

We do this by making use of the same Parquet metadata that we use for schema inference. We take a look at the metadata and make use of various fields such as the reported uncompressed_size, compressed_size of each column. Then we use these statistics to estimate the materialized size of data for reading a Parquet file, using its size on disk.

TODOs:

  • Account for dictionary encoding (and potentially other forms of encoding) -- when we read Parquet we go from compressed -> uncompressed -> decoded, and I think we still need to account for encoding here when thinking about how much memory this data will take up when decoded into Daft Series.

@github-actions github-actions bot added the enhancement New feature or request label Nov 15, 2024
Copy link

codspeed-hq bot commented Nov 15, 2024

CodSpeed Performance Report

Merging #3302 will improve performances by 42.24%

Comparing jay/better-scan-task-estimations-2 (b6a7b7f) with main (60ae62f)

Summary

⚡ 1 improvements
✅ 16 untouched benchmarks

Benchmarks breakdown

Benchmark main jay/better-scan-task-estimations-2 Change
test_iter_rows_first_row[100 Small Files] 388.4 ms 273.1 ms +42.24%

@@ -269,6 +270,9 @@ pub(crate) fn split_by_row_groups(

*chunk_spec = Some(ChunkSpec::Parquet(curr_row_group_indices));
*size_bytes = Some(curr_size_bytes as u64);

// Re-estimate the size bytes in memory
new_estimated_size_bytes_in_memory = t.estimated_materialized_size_bytes.map(|est| (est as f64 * (curr_num_rows as f64 / file.num_rows as f64)) as usize);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kevinzwang could you take a look at this logic for splitting ScanTasks, and trying to correctly predict the resultant estimated materialized size bytes?

Looks like we're doing some crazy stuff wrt modifying the FileMetadata and I couldn't really figure out if it is safe to do this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks reasonable to me

@kevinzwang
Copy link
Member

@jaychia is this ready for review? Looks like a lot of tests are still failing

@jaychia
Copy link
Contributor Author

jaychia commented Nov 19, 2024

@jaychia is this ready for review? Looks like a lot of tests are still failing

Sorry, thanks for calling me out -- I have to do some more refactors to this PR. Taking this back into draft mode and un-requesting reviews.

@jaychia jaychia marked this pull request as draft November 19, 2024 21:16
@jaychia jaychia force-pushed the jay/better-scan-task-estimations-2 branch from 6c7bd68 to cafe6b3 Compare November 19, 2024 21:18
@jaychia jaychia force-pushed the jay/better-scan-task-estimations-2 branch from 2fcee2f to e516c54 Compare November 20, 2024 22:10
Copy link

codecov bot commented Nov 20, 2024

Codecov Report

Attention: Patch coverage is 96.47887% with 15 lines in your changes missing coverage. Please review.

Project coverage is 77.44%. Comparing base (b6695eb) to head (b6a7b7f).
Report is 16 commits behind head on main.

Files with missing lines Patch % Lines
src/daft-scan/src/glob.rs 91.02% 7 Missing ⚠️
src/daft-scan/src/size_estimations.rs 98.12% 6 Missing ⚠️
src/daft-scan/src/anonymous.rs 0.00% 1 Missing ⚠️
src/daft-scan/src/python.rs 66.66% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3302      +/-   ##
==========================================
+ Coverage   77.39%   77.44%   +0.05%     
==========================================
  Files         678      686       +8     
  Lines       83300    84006     +706     
==========================================
+ Hits        64469    65060     +591     
- Misses      18831    18946     +115     
Files with missing lines Coverage Δ
src/daft-micropartition/src/micropartition.rs 90.85% <100.00%> (+0.03%) ⬆️
src/daft-micropartition/src/ops/cast_to_schema.rs 100.00% <100.00%> (ø)
src/daft-scan/src/lib.rs 61.14% <100.00%> (+0.87%) ⬆️
src/daft-scan/src/scan_task_iters.rs 97.01% <100.00%> (+0.06%) ⬆️
src/daft-scan/src/anonymous.rs 0.00% <0.00%> (ø)
src/daft-scan/src/python.rs 76.64% <66.66%> (-0.09%) ⬇️
src/daft-scan/src/size_estimations.rs 98.12% <98.12%> (ø)
src/daft-scan/src/glob.rs 90.78% <91.02%> (+0.50%) ⬆️

... and 25 files with indirect coverage changes

---- 🚨 Try these New Features:

@jaychia jaychia marked this pull request as ready for review November 22, 2024 21:47
@jaychia
Copy link
Contributor Author

jaychia commented Nov 23, 2024

Actually, I'm unhappy with this approach and think we need a more sophisticated approach. Closing this PR and going to start a new one.

The problem with the approach in this PR is that it only uses the FileMetadata, which unfortunately doesn't give us a good mechanism of actually figuring out the size of data after both decompression and decoding. More concretely, we need access to some of the data pages (the dictionary page being the most important one) in order to make good decisions.

@jaychia jaychia closed this Nov 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants