Allow daft.read_parquet take in custom statistics from the user #1898

raghumdani · 2024-02-19T23:30:27Z

Is your feature request related to a problem? Please describe.
Yes, we have seen a tremendous amount of OOMs especially due to our lack of ability to accurately estimate file sizes when reading. We have devised a tracking system in our catalog that helps us hints what the in-memory size of the parquet file could be. Moreover, the execution plan changes the memory requirement in ways not very easy to estimate.

Describe the solution you'd like
It would be ideal if Daft allows users to specify hints on what inflation to use on top of size represented in total_uncompressed_size parquet metadata and to override exact in-memory size. The reason is that, in-memory size can be very much different from total_uncompressed_size that parquet calculates which is compute agnostic. Also, similar functionality to be supported for read_csv as well. The exact size data is already available in our catalog and is compliant with Iceberg: https://iceberg.apache.org/spec/#manifests.

# use-case 1
daft.read_parquet(file, inflation_multiplier=3.2)

# use-case 2
daft.read_parquet({name: 's3://file', stats_override: { in_memory_bytes_for_compute: "10202929239"}})

Describe alternatives you've considered

Store the custom sizes in the schema of the parquet so that Daft can interpret. The problem with that is only Daft can interpret it.
Use total_uncompressed_size which is not accurate.

Additional context
Let me know if this use-case makes sense as it will avoid OOMs and potentially allow us to just leverage Daft reader without having to perform the execution of a plan ourselves.

The text was updated successfully, but these errors were encountered:

samster25 · 2024-02-20T04:20:01Z

Work in progress PR to improve the in memory estimate (that can be configured)
#1901

* Closes: #1898 1. When column stats are provided, use only the columns in the materialized schema to estimate in memory size * when column stats are missing, fall back on schema estimate for that field 2. When num_rows is provided, use the materialized schema to estimate in memory size 3. When neither are provided, estimate the in memory size using an inflation factor (same as our writes) and approximate the number of rows. Then use the materialized schema to estimate in memory size 4. thread through the new in memory estimator to the ScanWithTask physical op

samster25 mentioned this issue Feb 20, 2024

[PERF] scan task in memory estimate #1901

Merged

jaychia added this to Daft-OSS Feb 20, 2024

github-project-automation bot moved this to On Deck in Daft-OSS Feb 20, 2024

jaychia added the p0 Priority 0 - to be addressed immediately label Feb 20, 2024

samster25 closed this as completed in #1901 Feb 27, 2024

github-project-automation bot moved this from On Deck to Done in Daft-OSS Feb 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow daft.read_parquet take in custom statistics from the user #1898

Allow daft.read_parquet take in custom statistics from the user #1898

raghumdani commented Feb 19, 2024

samster25 commented Feb 20, 2024 •

edited

Loading

Allow daft.read_parquet take in custom statistics from the user #1898

Allow daft.read_parquet take in custom statistics from the user #1898

Comments

raghumdani commented Feb 19, 2024

samster25 commented Feb 20, 2024 • edited Loading

samster25 commented Feb 20, 2024 •

edited

Loading