-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PERF] Split parquet scan tasks into individual row groups #1799
Conversation
src/daft-scan/src/glob.rs
Outdated
vec![DataFileSource::AnonymousDataFile { | ||
path: path.to_string(), | ||
chunk_spec: Some(ChunkSpec::Parquet(vec![rg as i64])), | ||
size_bytes: Some(rgm.compressed_size() as u64), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I interpreted size_bytes
as representing the file size, which stores the data compressed, so I'm pretty sure .compressed_size()
is the right method, but if it's the size of the data, then we should use .total_byte_size()
.
The function See: https://github.com/Eventual-Inc/Daft/blob/main/src/daft-micropartition/src/micropartition.rs#L221 |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1799 +/- ##
=======================================
Coverage 85.47% 85.47%
=======================================
Files 55 55
Lines 6119 6119
=======================================
Hits 5230 5230
Misses 889 889
|
Conclusion of conversation with @samster25: don't do row group splitting when using python reader, since it is a legacy feature that we don't care about performance for |
Also let me know if I should write tests for non-local reads. |
import daft | ||
|
||
FILES = [ | ||
"tests/assets/parquet-data/mvp.parquet", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also add some of the s3 file sources here too. I believe we all these files in s3 as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just removed this test and added a fixture to test_reads_public_data.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall, just have a few nits and questions about the interaction between splitting and merging scan tasks.
Benchmark results
All times averaged over 5 runs.
Single file read
Read one file in S3 using Ray with 4 worker nodes
Parquet file info:
Multi-file workflow
Read and aggregate (Dataframe.count()) 32 files in S3 using Ray with 4 worker nodes
Parquet file info:
(Averaged over 5 runs)