[PERF] scan task in memory estimate #1901

samster25 · 2024-02-20T04:11:33Z

Closes: Allow daft.read_parquet take in custom statistics from the user #1898

When column stats are provided, use only the columns in the materialized schema to estimate in memory size

when column stats are missing, fall back on schema estimate for that field

When num_rows is provided, use the materialized schema to estimate in memory size
When neither are provided, estimate the in memory size using an inflation factor (same as our writes) and approximate the number of rows. Then use the materialized schema to estimate in memory size
thread through the new in memory estimator to the ScanWithTask physical op

codecov · 2024-02-20T04:23:41Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 85.54%. Comparing base (1a94752) to head (0a1cdc8).

❗ Current head 0a1cdc8 differs from pull request most recent head 6bafa32. Consider uploading reports for the commit 6bafa32 to get more accurate results

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1901      +/-   ##
==========================================
+ Coverage   83.93%   85.54%   +1.61%     
==========================================
  Files          55       55              
  Lines        6112     6228     +116     
==========================================
+ Hits         5130     5328     +198     
+ Misses        982      900      -82

Files	Coverage Δ
daft/execution/rust_physical_plan_shim.py	`93.93% <100.00%> (-1.52%)`	⬇️

... and 15 files with indirect coverage changes

jaychia

Seems good, should we write some unit tests for these?

src/daft-core/src/datatypes/dtype.rs

src/daft-micropartition/src/micropartition.rs

src/daft-scan/src/lib.rs

* Closes: #1898 1. When column stats are provided, use only the columns in the materialized schema to estimate in memory size * when column stats are missing, fall back on schema estimate for that field 2. When num_rows is provided, use the materialized schema to estimate in memory size 3. When neither are provided, estimate the in memory size using an inflation factor (same as our writes) and approximate the number of rows. Then use the materialized schema to estimate in memory size 4. thread through the new in memory estimator to the ScanWithTask physical op

samster25 added 4 commits February 19, 2024 18:34

wip

7e3d79c

in memory estimation for scan tasks

e12e46d

clippy fixes

04ac6d2

update disk size fallback with projection

f57a9e6

github-actions bot added the performance label Feb 20, 2024

samster25 requested review from jaychia and clarkzinzow February 20, 2024 04:17

samster25 mentioned this pull request Feb 20, 2024

Allow daft.read_parquet take in custom statistics from the user #1898

Closed

style

91a4b2f

jaychia approved these changes Feb 20, 2024

View reviewed changes

src/daft-core/src/datatypes/dtype.rs Show resolved Hide resolved

src/daft-micropartition/src/micropartition.rs Show resolved Hide resolved

src/daft-scan/src/lib.rs Outdated Show resolved Hide resolved

samster25 added 3 commits February 22, 2024 12:20

use cow

03b8dd2

Merge branch 'main' into sammy/scan-task-in-memory-estimate

0a1cdc8

merged main

6bafa32

samster25 enabled auto-merge (squash) February 27, 2024 00:58

samster25 merged commit cc7b957 into main Feb 27, 2024
27 checks passed

samster25 deleted the sammy/scan-task-in-memory-estimate branch February 27, 2024 01:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PERF] scan task in memory estimate #1901

[PERF] scan task in memory estimate #1901

samster25 commented Feb 20, 2024 •

edited

Loading

codecov bot commented Feb 20, 2024 •

edited

Loading

jaychia left a comment

[PERF] scan task in memory estimate #1901

[PERF] scan task in memory estimate #1901

Conversation

samster25 commented Feb 20, 2024 • edited Loading

codecov bot commented Feb 20, 2024 • edited Loading

Codecov Report

jaychia left a comment

Choose a reason for hiding this comment

samster25 commented Feb 20, 2024 •

edited

Loading

codecov bot commented Feb 20, 2024 •

edited

Loading