Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow daft.read_parquet take in custom statistics from the user #1898

Closed
raghumdani opened this issue Feb 19, 2024 · 1 comment · Fixed by #1901
Closed

Allow daft.read_parquet take in custom statistics from the user #1898

raghumdani opened this issue Feb 19, 2024 · 1 comment · Fixed by #1901
Labels
p0 Priority 0 - to be addressed immediately

Comments

@raghumdani
Copy link

Is your feature request related to a problem? Please describe.
Yes, we have seen a tremendous amount of OOMs especially due to our lack of ability to accurately estimate file sizes when reading. We have devised a tracking system in our catalog that helps us hints what the in-memory size of the parquet file could be. Moreover, the execution plan changes the memory requirement in ways not very easy to estimate.

Describe the solution you'd like
It would be ideal if Daft allows users to specify hints on what inflation to use on top of size represented in total_uncompressed_size parquet metadata and to override exact in-memory size. The reason is that, in-memory size can be very much different from total_uncompressed_size that parquet calculates which is compute agnostic. Also, similar functionality to be supported for read_csv as well. The exact size data is already available in our catalog and is compliant with Iceberg: https://iceberg.apache.org/spec/#manifests.

# use-case 1
daft.read_parquet(file, inflation_multiplier=3.2)

# use-case 2
daft.read_parquet({name: 's3://file', stats_override: { in_memory_bytes_for_compute: "10202929239"}})

Describe alternatives you've considered

  • Store the custom sizes in the schema of the parquet so that Daft can interpret. The problem with that is only Daft can interpret it.
  • Use total_uncompressed_size which is not accurate.

Additional context
Let me know if this use-case makes sense as it will avoid OOMs and potentially allow us to just leverage Daft reader without having to perform the execution of a plan ourselves.

@samster25
Copy link
Member

samster25 commented Feb 20, 2024

Work in progress PR to improve the in memory estimate (that can be configured)
#1901

@jaychia jaychia added this to Daft-OSS Feb 20, 2024
@github-project-automation github-project-automation bot moved this to On Deck in Daft-OSS Feb 20, 2024
@jaychia jaychia added the p0 Priority 0 - to be addressed immediately label Feb 20, 2024
samster25 added a commit that referenced this issue Feb 27, 2024
* Closes: #1898

1. When column stats are provided, use only the columns in the
materialized schema to estimate in memory size
* when column stats are missing, fall back on schema estimate for that
field
2. When num_rows is provided, use the materialized schema to estimate in
memory size
3. When neither are provided, estimate the in memory size using an
inflation factor (same as our writes) and approximate the number of
rows. Then use the materialized schema to estimate in memory size
4. thread through the new in memory estimator to the ScanWithTask
physical op
@github-project-automation github-project-automation bot moved this from On Deck to Done in Daft-OSS Feb 27, 2024
samster25 added a commit that referenced this issue Feb 27, 2024
* Closes: #1898

1. When column stats are provided, use only the columns in the
materialized schema to estimate in memory size
* when column stats are missing, fall back on schema estimate for that
field
2. When num_rows is provided, use the materialized schema to estimate in
memory size
3. When neither are provided, estimate the in memory size using an
inflation factor (same as our writes) and approximate the number of
rows. Then use the materialized schema to estimate in memory size
4. thread through the new in memory estimator to the ScanWithTask
physical op
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
p0 Priority 0 - to be addressed immediately
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants