-
Notifications
You must be signed in to change notification settings - Fork 175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow daft.read_parquet take in custom statistics from the user #1898
Labels
p0
Priority 0 - to be addressed immediately
Comments
Work in progress PR to improve the in memory estimate (that can be configured) |
samster25
added a commit
that referenced
this issue
Feb 27, 2024
* Closes: #1898 1. When column stats are provided, use only the columns in the materialized schema to estimate in memory size * when column stats are missing, fall back on schema estimate for that field 2. When num_rows is provided, use the materialized schema to estimate in memory size 3. When neither are provided, estimate the in memory size using an inflation factor (same as our writes) and approximate the number of rows. Then use the materialized schema to estimate in memory size 4. thread through the new in memory estimator to the ScanWithTask physical op
samster25
added a commit
that referenced
this issue
Feb 27, 2024
* Closes: #1898 1. When column stats are provided, use only the columns in the materialized schema to estimate in memory size * when column stats are missing, fall back on schema estimate for that field 2. When num_rows is provided, use the materialized schema to estimate in memory size 3. When neither are provided, estimate the in memory size using an inflation factor (same as our writes) and approximate the number of rows. Then use the materialized schema to estimate in memory size 4. thread through the new in memory estimator to the ScanWithTask physical op
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Is your feature request related to a problem? Please describe.
Yes, we have seen a tremendous amount of OOMs especially due to our lack of ability to accurately estimate file sizes when reading. We have devised a tracking system in our catalog that helps us hints what the in-memory size of the parquet file could be. Moreover, the execution plan changes the memory requirement in ways not very easy to estimate.
Describe the solution you'd like
It would be ideal if Daft allows users to specify hints on what inflation to use on top of size represented in
total_uncompressed_size
parquet metadata and to override exact in-memory size. The reason is that, in-memory size can be very much different fromtotal_uncompressed_size
that parquet calculates which is compute agnostic. Also, similar functionality to be supported forread_csv
as well. The exact size data is already available in our catalog and is compliant with Iceberg: https://iceberg.apache.org/spec/#manifests.Describe alternatives you've considered
Additional context
Let me know if this use-case makes sense as it will avoid OOMs and potentially allow us to just leverage Daft reader without having to perform the execution of a plan ourselves.
The text was updated successfully, but these errors were encountered: