Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PERF] scan task in memory estimate #1901

Merged
merged 8 commits into from
Feb 27, 2024
Merged

Conversation

samster25
Copy link
Member

@samster25 samster25 commented Feb 20, 2024

  1. When column stats are provided, use only the columns in the materialized schema to estimate in memory size
  • when column stats are missing, fall back on schema estimate for that field
  1. When num_rows is provided, use the materialized schema to estimate in memory size
  2. When neither are provided, estimate the in memory size using an inflation factor (same as our writes) and approximate the number of rows. Then use the materialized schema to estimate in memory size
  3. thread through the new in memory estimator to the ScanWithTask physical op

Copy link

codecov bot commented Feb 20, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 85.54%. Comparing base (1a94752) to head (0a1cdc8).

❗ Current head 0a1cdc8 differs from pull request most recent head 6bafa32. Consider uploading reports for the commit 6bafa32 to get more accurate results

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1901      +/-   ##
==========================================
+ Coverage   83.93%   85.54%   +1.61%     
==========================================
  Files          55       55              
  Lines        6112     6228     +116     
==========================================
+ Hits         5130     5328     +198     
+ Misses        982      900      -82     
Files Coverage Δ
daft/execution/rust_physical_plan_shim.py 93.93% <100.00%> (-1.52%) ⬇️

... and 15 files with indirect coverage changes

Copy link
Contributor

@jaychia jaychia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems good, should we write some unit tests for these?

src/daft-core/src/datatypes/dtype.rs Show resolved Hide resolved
src/daft-micropartition/src/micropartition.rs Show resolved Hide resolved
src/daft-scan/src/lib.rs Outdated Show resolved Hide resolved
@samster25 samster25 enabled auto-merge (squash) February 27, 2024 00:58
@samster25 samster25 merged commit cc7b957 into main Feb 27, 2024
27 checks passed
@samster25 samster25 deleted the sammy/scan-task-in-memory-estimate branch February 27, 2024 01:09
samster25 added a commit that referenced this pull request Feb 27, 2024
* Closes: #1898

1. When column stats are provided, use only the columns in the
materialized schema to estimate in memory size
* when column stats are missing, fall back on schema estimate for that
field
2. When num_rows is provided, use the materialized schema to estimate in
memory size
3. When neither are provided, estimate the in memory size using an
inflation factor (same as our writes) and approximate the number of
rows. Then use the materialized schema to estimate in memory size
4. thread through the new in memory estimator to the ScanWithTask
physical op
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow daft.read_parquet take in custom statistics from the user
2 participants