[FEAT] Estimate materialized size of ScanTask better from Parquet reads #3302

jaychia · 2024-11-15T00:58:30Z

Adds better estimation of the materialized bytes in memory for a given Parquet ScanTask.

We do this by making use of the same Parquet metadata that we use for schema inference. We take a look at the metadata and make use of various fields such as the reported uncompressed_size, compressed_size of each column. Then we use these statistics to estimate the materialized size of data for reading a Parquet file, using its size on disk.

TODOs:

Account for dictionary encoding (and potentially other forms of encoding) -- when we read Parquet we go from compressed -> uncompressed -> decoded, and I think we still need to account for encoding here when thinking about how much memory this data will take up when decoded into Daft Series.

codspeed-hq · 2024-11-15T01:05:24Z

CodSpeed Performance Report

Merging #3302 will improve performances by 42.24%

_{Comparing jay/better-scan-task-estimations-2 (b6a7b7f) with main (60ae62f)}

Summary

⚡ 1 improvements
✅ 16 untouched benchmarks

Benchmarks breakdown

	Benchmark	`main`	`jay/better-scan-task-estimations-2`	Change
⚡	`test_iter_rows_first_row[100 Small Files]`	388.4 ms	273.1 ms	+42.24%

jaychia · 2024-11-15T01:10:01Z

src/daft-scan/src/scan_task_iters.rs

@@ -269,6 +270,9 @@ pub(crate) fn split_by_row_groups(

                                    *chunk_spec = Some(ChunkSpec::Parquet(curr_row_group_indices));
                                    *size_bytes = Some(curr_size_bytes as u64);
+
+                                    // Re-estimate the size bytes in memory
+                                    new_estimated_size_bytes_in_memory = t.estimated_materialized_size_bytes.map(|est| (est as f64 * (curr_num_rows as f64 / file.num_rows as f64)) as usize);


@kevinzwang could you take a look at this logic for splitting ScanTasks, and trying to correctly predict the resultant estimated materialized size bytes?

Looks like we're doing some crazy stuff wrt modifying the FileMetadata and I couldn't really figure out if it is safe to do this.

This looks reasonable to me

kevinzwang · 2024-11-19T00:20:06Z

@jaychia is this ready for review? Looks like a lot of tests are still failing

jaychia · 2024-11-19T21:16:00Z

@jaychia is this ready for review? Looks like a lot of tests are still failing

Sorry, thanks for calling me out -- I have to do some more refactors to this PR. Taking this back into draft mode and un-requesting reviews.

codecov · 2024-11-20T22:31:35Z

Codecov Report

Attention: Patch coverage is 96.47887% with 15 lines in your changes missing coverage. Please review.

Project coverage is 77.44%. Comparing base (b6695eb) to head (b6a7b7f).
Report is 16 commits behind head on main.

Files with missing lines	Patch %	Lines
src/daft-scan/src/glob.rs	91.02%	7 Missing ⚠️
src/daft-scan/src/size_estimations.rs	98.12%	6 Missing ⚠️
src/daft-scan/src/anonymous.rs	0.00%	1 Missing ⚠️
src/daft-scan/src/python.rs	66.66%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3302      +/-   ##
==========================================
+ Coverage   77.39%   77.44%   +0.05%     
==========================================
  Files         678      686       +8     
  Lines       83300    84006     +706     
==========================================
+ Hits        64469    65060     +591     
- Misses      18831    18946     +115

Files with missing lines	Coverage Δ
src/daft-micropartition/src/micropartition.rs	`90.85% <100.00%> (+0.03%)`	⬆️
src/daft-micropartition/src/ops/cast_to_schema.rs	`100.00% <100.00%> (ø)`
src/daft-scan/src/lib.rs	`61.14% <100.00%> (+0.87%)`	⬆️
src/daft-scan/src/scan_task_iters.rs	`97.01% <100.00%> (+0.06%)`	⬆️
src/daft-scan/src/anonymous.rs	`0.00% <0.00%> (ø)`
src/daft-scan/src/python.rs	`76.64% <66.66%> (-0.09%)`	⬇️
src/daft-scan/src/size_estimations.rs	`98.12% <98.12%> (ø)`
src/daft-scan/src/glob.rs	`90.78% <91.02%> (+0.50%)`	⬆️

... and 25 files with indirect coverage changes

---- 🚨 Try these New Features:

Flaky Tests Detection - Detect and resolve failed and flaky tests

jaychia · 2024-11-23T02:30:21Z

Actually, I'm unhappy with this approach and think we need a more sophisticated approach. Closing this PR and going to start a new one.

The problem with the approach in this PR is that it only uses the FileMetadata, which unfortunately doesn't give us a good mechanism of actually figuring out the size of data after both decompression and decoding. More concretely, we need access to some of the data pages (the dictionary page being the most important one) in order to make good decisions.

github-actions bot added the enhancement New feature or request label Nov 15, 2024

jaychia mentioned this pull request Nov 15, 2024

[FEAT] Better ScanTask sizing estimations #3257

Closed

jaychia commented Nov 15, 2024

View reviewed changes

jaychia requested review from desmondcheongzx and kevinzwang November 15, 2024 01:10

jaychia removed request for desmondcheongzx and kevinzwang November 19, 2024 21:16

jaychia marked this pull request as draft November 19, 2024 21:16

jaychia force-pushed the jay/better-scan-task-estimations-2 branch from 6c7bd68 to cafe6b3 Compare November 19, 2024 21:18

Jay Chia added 8 commits November 20, 2024 14:10

[FEAT] Better sizing of ScanTasks

2b9aae9

Add estimations for micropartition read from parquet

03635e1

Create a size estimator in GlobScanOperator

0b982e8

Have GlobScanOperator perform estimations during ScanTask generation

016e8cb

Simplify MicroPartitions

802bfa4

Fix divide by zero errors caused by empty Parquet files

1f57aef

Make use of the field

1e13b26

nit

e516c54

jaychia force-pushed the jay/better-scan-task-estimations-2 branch from 2fcee2f to e516c54 Compare November 20, 2024 22:10

Jay Chia added 2 commits November 22, 2024 13:42

Add unit tests

6bcece7

Rename for clarity

b6a7b7f

jaychia requested a review from desmondcheongzx November 22, 2024 21:47

jaychia marked this pull request as ready for review November 22, 2024 21:47

jaychia closed this Nov 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEAT] Estimate materialized size of ScanTask better from Parquet reads #3302

[FEAT] Estimate materialized size of ScanTask better from Parquet reads #3302

jaychia commented Nov 15, 2024 •

edited

Loading

codspeed-hq bot commented Nov 15, 2024 •

edited

Loading

jaychia Nov 15, 2024

kevinzwang Nov 19, 2024

kevinzwang commented Nov 19, 2024

jaychia commented Nov 19, 2024

codecov bot commented Nov 20, 2024 •

edited

Loading

jaychia commented Nov 23, 2024 •

edited

Loading

[FEAT] Estimate materialized size of ScanTask better from Parquet reads #3302

[FEAT] Estimate materialized size of ScanTask better from Parquet reads #3302

Conversation

jaychia commented Nov 15, 2024 • edited Loading

codspeed-hq bot commented Nov 15, 2024 • edited Loading

CodSpeed Performance Report

Merging #3302 will improve performances by 42.24%

Summary

Benchmarks breakdown

jaychia Nov 15, 2024

Choose a reason for hiding this comment

kevinzwang Nov 19, 2024

Choose a reason for hiding this comment

kevinzwang commented Nov 19, 2024

jaychia commented Nov 19, 2024

codecov bot commented Nov 20, 2024 • edited Loading

Codecov Report

jaychia commented Nov 23, 2024 • edited Loading

jaychia commented Nov 15, 2024 •

edited

Loading

codspeed-hq bot commented Nov 15, 2024 •

edited

Loading

codecov bot commented Nov 20, 2024 •

edited

Loading

jaychia commented Nov 23, 2024 •

edited

Loading