-
Notifications
You must be signed in to change notification settings - Fork 175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEAT] Streaming physical writes for native executor #2992
Conversation
CodSpeed Performance ReportMerging #2992 will not alter performanceComparing Summary
|
daft/io/writer.py
Outdated
raise NotImplementedError("Subclasses must implement this method.") | ||
|
||
def write(self, table: MicroPartition): | ||
if self.current_writer is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be better here to rely on override-able methods or properties.
so something like if self.current_writer() is None:
Ideally, you can center the logic here and then have the child classes implement the specifics for each file type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I made the methods in FileWriterBase all abstract methods, so the child classes have their own implementation.
let mut current_writer: Option<Box<dyn FileWriter>> = None; | ||
let mut current_file_idx = None; | ||
while let Some((data, file_idx)) = input_receiver.recv().await { | ||
if current_file_idx.is_none() || current_file_idx.unwrap() != file_idx { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed the general pattern of how we are approaching file writes and I think we can improve it with a better abstraction.
So we already have the trait FileWriter
and trait FileWriterFactory
but imagine if we can start layering them.
So at the base we can have ParquetFileWriter
but instead of implementing the row group batching in the executor you can instead implement RowBatcherWriter
that takes in a FileWriterFactory
and then gives you a new factory.
Then you can pass that factory into a TargetFileSizeWriter
that will target the target file size for each file it's writing out. Finally that factory can be passed into a PartitionedWriter
that will partition by value.
This pattern is pretty common for building adaptors for writers can be seen here in the Iceberg sdk
https://github.com/apache/iceberg/blob/2b55fef7cc2a249d864ac26d85a4923313d96a59/core/src/main/java/org/apache/iceberg/io/PartitionedWriter.java
The cool thing is that these Filewriters be be parametered at runtime so the executor can likely be very simple.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I get it. So PartitionedWriter
is itself a FileWriter
, which holds a WriterFactory
which can generate TargetFileSizeWriter
which is also a FileWriter
and also holds a WriterFactory
which can generate a RowBatchWriter
, which is also a FileWriter
etc etc etc.
Implemented this in latest commit, as a blocking sink.
Table containing metadata about the written file, including path and partition values. | ||
""" | ||
pass | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should also have a finalize
method rather than overloading close
to start a next file and closing the last file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was actually intending for these Python writers to be non rotating. i.e. no writing after closing. They should be given a unique file_idx for the file_name generation upon construction, and unique set of partition_values.
I will add assertions and some comments to document this behaviour
pass | ||
|
||
@abstractmethod | ||
def close(self) -> Table: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we name this something like start_next_file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mirroring the above comment, the python file writers should not write after close
@@ -1276,5 +1277,27 @@ impl Display for MicroPartition { | |||
} | |||
} | |||
|
|||
impl Bufferable for MicroPartition { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since no one else besides the execution crate is going to use this. What I would recommend is to do this in daft-local-execution
to impl this trait on MicroPartition
.
this is called the newtype pattern
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At first i was intending for the targetbatchwriter
to use the bufferable stuff as well, but i think it's simpler to have it implement it's own buffering logic. And let local execution have it's own buffering logic as well.
} | ||
}, | ||
); | ||
match entry { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not a fan of using the raw pattern outside of the probe table.
Not blocking it for this PR but we should have some kind of way to create Scalars so we can just cram those into the hashmap
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would storing the string representation of the partition values be a better alternative for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm but then we'd have to do string conversion every time, so probably not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make an issue for this and tackle this after we create ScalarValue
/// to a separate file. It uses a map to keep track of the writers for each partition. | ||
struct PartitionedWriter { | ||
per_partition_writers: | ||
HashMap<IndexHash, Box<dyn FileWriter<Input = Arc<MicroPartition>, Result = Vec<Table>>>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tbh, we should do something like a ScalarValue
that wraps the possible partition value types for our non performance critical stuff like here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work!
} | ||
}, | ||
); | ||
match entry { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make an issue for this and tackle this after we create ScalarValue
Streaming writes for swordfish (parquet + csv only). Iceberg and delta writes are here: Eventual-Inc#2966 Implement streaming writes as a blocking sink. Unpartitioned writes run with 1 worker, and Partitioned writes run with NUM_CPUs workers. As a drive by, made blocking sinks parallelizable. **Behaviour** - Unpartitioned: Make writes to a `TargetFileSizeWriter`, which manages file sizes and row group sizes, as data is streamed in. - Partitioned: Partition data via a `Dispatcher` and send to workers based on the hash. Each worker runs a `PartitionedWriter` that manages partitioning by value, file sizes, and row group sizes. **Benchmarks:** I made a new benchmark suite in `tests/benchmarks/test_streaming_writes.py`, it tests writes of tpch lineitem to parquet/csv with/without partition columns and different file/rowgroup size. The streaming executor performs much better when there are partition columns, as seen in this screenshot. Without partition columns it is about the same, when target row group size / file size is decreased, it is slightly slower. Likely due to the fact that probably does more slicing, but will need to investigate more. Memory usage is the same for both. <img width="1400" alt="Screenshot 2024-10-03 at 11 22 32 AM" src="https://github.com/user-attachments/assets/53b4d77d-553a-4181-8a4d-9eddaa3adaf7"> Memory test on read->write parquet tpch lineitem sf1: Native: <img width="1078" alt="Screenshot 2024-10-08 at 1 48 34 PM" src="https://github.com/user-attachments/assets/3eda33c6-9413-415f-b808-ac3c7437e269"> Python: <img width="1090" alt="Screenshot 2024-10-08 at 1 48 50 PM" src="https://github.com/user-attachments/assets/f92b9a9f-a3b5-408b-98d5-4ba2d66b7be4"> --------- Co-authored-by: Colin Ho <[email protected]> Co-authored-by: Colin Ho <[email protected]> Co-authored-by: Colin Ho <[email protected]>
Implements streaming Iceberg and Delta writes for swordfish. Most of the write scaffolding has already been implemented in #2992, this PR implements the Iceberg/Delta specific functionalities. A quick TLDR on swordfish writes: - All of the row group sizing, file sizing, partitioning, is now handled in the `daft-writer` crate. - Only the actual writing + flushing is currently handled via Pyarrow Parquet + Csv writers. We intend to build our own native writers in the future. Notes: - Modified the iceberg writes such that: 1. the plan now stores just the spec id + partition cols (we used to keep the whole partitionspec object in the plan but only use the id, maybe we planned on keeping it around for future work, not sure tho pls lmk) 2. I made the `add_missing_columns` stuff an explicit projection. It was a lil cleaner this way instead of having swordfish implement `add_missing_columns` internally. --------- Co-authored-by: Colin Ho <[email protected]> Co-authored-by: Colin Ho <[email protected]>
This PR implements Lance writes for swordfish. The scaffolding for writes was merged in: #2992, and so this one simply adds the lance writes functionality. Co-authored-by: Colin Ho <[email protected]>
Streaming writes for swordfish (parquet + csv only). Iceberg and delta writes are here: #2966
Implement streaming writes as a blocking sink. Unpartitioned writes run with 1 worker, and Partitioned writes run with NUM_CPUs workers. As a drive by, made blocking sinks parallelizable.
Behaviour
Unpartitioned: Make writes to a
TargetFileSizeWriter
, which manages file sizes and row group sizes, as data is streamed in.Partitioned: Partition data via a
Dispatcher
and send to workers based on the hash. Each worker runs aPartitionedWriter
that manages partitioning by value, file sizes, and row group sizes.Benchmarks:
I made a new benchmark suite in
tests/benchmarks/test_streaming_writes.py
, it tests writes of tpch lineitem to parquet/csv with/without partition columns and different file/rowgroup size. The streaming executor performs much better when there are partition columns, as seen in this screenshot. Without partition columns it is about the same, when target row group size / file size is decreased, it is slightly slower. Likely due to the fact that probably does more slicing, but will need to investigate more. Memory usage is the same for both.Memory test on read->write parquet tpch lineitem sf1:
Native:
Python: