[FEAT] Streaming Catalog Writes #2966

colin-ho · 2024-09-28T00:43:26Z

No description provided.

codspeed-hq · 2024-09-30T17:50:51Z

CodSpeed Performance Report

Merging #2966 will not alter performance

_{Comparing colin/streaming-catalog-writes (f835c47) with main (fe4553f)}

Summary

✅ 17 untouched benchmarks

codecov · 2024-09-30T18:12:08Z

Codecov Report

Attention: Patch coverage is 79.95578% with 272 lines in your changes missing coverage. Please review.

Project coverage is 78.34%. Comparing base (b2dabf6) to head (f835c47).
Report is 86 commits behind head on main.

Files with missing lines	Patch %	Lines
daft/io/writer.py	0.00%	163 Missing ⚠️
...-local-execution/src/writes/unpartitioned_write.rs	85.05%	26 Missing ⚠️
...ft-local-execution/src/writes/partitioned_write.rs	89.47%	24 Missing ⚠️
src/daft-local-execution/src/buffer.rs	76.19%	15 Missing ⚠️
src/daft-local-execution/src/sources/scan_task.rs	79.36%	13 Missing ⚠️
src/daft-parquet/src/stream_reader.rs	45.83%	13 Missing ⚠️
src/daft-local-execution/src/pipeline.rs	95.42%	7 Missing ⚠️
src/daft-micropartition/src/lib.rs	90.62%	6 Missing ⚠️
...-execution/src/intermediate_ops/intermediate_op.rs	66.66%	2 Missing ⚠️
src/daft-micropartition/src/py_writers.rs	99.42%	1 Missing ⚠️
... and 2 more

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2966      +/-   ##
==========================================
+ Coverage   78.22%   78.34%   +0.11%     
==========================================
  Files         598      609      +11     
  Lines       70556    72222    +1666     
==========================================
+ Hits        55194    56583    +1389     
- Misses      15362    15639     +277

Flag	Coverage Δ
	`?`

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
daft/logical/builder.py	`89.87% <100.00%> (+0.26%)`	⬆️
daft/table/table_io.py	`85.96% <ø> (ø)`
src/daft-local-execution/src/lib.rs	`90.47% <100.00%> (+0.73%)`	⬆️
src/daft-local-execution/src/run.rs	`89.84% <100.00%> (ø)`
...daft-local-execution/src/writes/deltalake_write.rs	`100.00% <100.00%> (ø)`
...c/daft-local-execution/src/writes/iceberg_write.rs	`100.00% <100.00%> (ø)`
.../daft-local-execution/src/writes/physical_write.rs	`100.00% <100.00%> (ø)`
src/daft-parquet/src/file.rs	`74.01% <100.00%> (+0.10%)`	⬆️
src/daft-physical-plan/src/local_plan.rs	`89.06% <100.00%> (+3.34%)`	⬆️
src/daft-plan/src/builder.rs	`81.87% <100.00%> (-11.07%)`	⬇️
... and 14 more

... and 34 files with indirect coverage changes

Streaming writes for swordfish (parquet + csv only). Iceberg and delta writes are here: #2966 Implement streaming writes as a blocking sink. Unpartitioned writes run with 1 worker, and Partitioned writes run with NUM_CPUs workers. As a drive by, made blocking sinks parallelizable. **Behaviour** - Unpartitioned: Make writes to a `TargetFileSizeWriter`, which manages file sizes and row group sizes, as data is streamed in. - Partitioned: Partition data via a `Dispatcher` and send to workers based on the hash. Each worker runs a `PartitionedWriter` that manages partitioning by value, file sizes, and row group sizes. **Benchmarks:** I made a new benchmark suite in `tests/benchmarks/test_streaming_writes.py`, it tests writes of tpch lineitem to parquet/csv with/without partition columns and different file/rowgroup size. The streaming executor performs much better when there are partition columns, as seen in this screenshot. Without partition columns it is about the same, when target row group size / file size is decreased, it is slightly slower. Likely due to the fact that probably does more slicing, but will need to investigate more. Memory usage is the same for both. <img width="1400" alt="Screenshot 2024-10-03 at 11 22 32 AM" src="https://github.com/user-attachments/assets/53b4d77d-553a-4181-8a4d-9eddaa3adaf7"> Memory test on read->write parquet tpch lineitem sf1: Native: <img width="1078" alt="Screenshot 2024-10-08 at 1 48 34 PM" src="https://github.com/user-attachments/assets/3eda33c6-9413-415f-b808-ac3c7437e269"> Python: <img width="1090" alt="Screenshot 2024-10-08 at 1 48 50 PM" src="https://github.com/user-attachments/assets/f92b9a9f-a3b5-408b-98d5-4ba2d66b7be4"> --------- Co-authored-by: Colin Ho <[email protected]> Co-authored-by: Colin Ho <[email protected]> Co-authored-by: Colin Ho <[email protected]>

Streaming writes for swordfish (parquet + csv only). Iceberg and delta writes are here: Eventual-Inc#2966 Implement streaming writes as a blocking sink. Unpartitioned writes run with 1 worker, and Partitioned writes run with NUM_CPUs workers. As a drive by, made blocking sinks parallelizable. **Behaviour** - Unpartitioned: Make writes to a `TargetFileSizeWriter`, which manages file sizes and row group sizes, as data is streamed in. - Partitioned: Partition data via a `Dispatcher` and send to workers based on the hash. Each worker runs a `PartitionedWriter` that manages partitioning by value, file sizes, and row group sizes. **Benchmarks:** I made a new benchmark suite in `tests/benchmarks/test_streaming_writes.py`, it tests writes of tpch lineitem to parquet/csv with/without partition columns and different file/rowgroup size. The streaming executor performs much better when there are partition columns, as seen in this screenshot. Without partition columns it is about the same, when target row group size / file size is decreased, it is slightly slower. Likely due to the fact that probably does more slicing, but will need to investigate more. Memory usage is the same for both. <img width="1400" alt="Screenshot 2024-10-03 at 11 22 32 AM" src="https://github.com/user-attachments/assets/53b4d77d-553a-4181-8a4d-9eddaa3adaf7"> Memory test on read->write parquet tpch lineitem sf1: Native: <img width="1078" alt="Screenshot 2024-10-08 at 1 48 34 PM" src="https://github.com/user-attachments/assets/3eda33c6-9413-415f-b808-ac3c7437e269"> Python: <img width="1090" alt="Screenshot 2024-10-08 at 1 48 50 PM" src="https://github.com/user-attachments/assets/f92b9a9f-a3b5-408b-98d5-4ba2d66b7be4"> --------- Co-authored-by: Colin Ho <[email protected]> Co-authored-by: Colin Ho <[email protected]> Co-authored-by: Colin Ho <[email protected]>

Colin Ho and others added 10 commits September 20, 2024 15:10

init

0ea9bb6

ok working

b6959d2

cleanup

be31bf4

cleanup a lil more

ae6c645

pa

8d80919

fix

8c70aa8

pq

c8708e7

concurrent

1589f9e

Merge branch main into colin/streaming-writes-v2

28ddf0d

it actually worked

48ebe9e

github-actions bot added the enhancement New feature or request label Sep 28, 2024

Colin Ho added 2 commits September 30, 2024 10:25

ICEBERG AND DELTA

73831d9

Merge branch main into colin/streaming-catalog-writes

ab3afcf

fix test

2d9a40a

Colin Ho added 11 commits September 30, 2024 12:24

working iceberg mor

dffeea8

cleanup

2972d8f

Merge branch main into colin/streaming-catalog-writes

138da54

use string

ff2a18c

skip empty part tests

a0e19aa

small cleanup

0f4b9dc

benchmark + speed up partitioned

4e2149b

fix benchmarks

a8f9a51

fix benchmarks

5c58047

fix benchmarks

48bb9d2

skip in ci

f835c47

colin-ho mentioned this pull request Oct 3, 2024

[FEAT] Streaming physical writes for native executor #2992

Merged

samster25 changed the base branch from main to colin/streaming-physical-writes October 22, 2024 00:46

samster25 changed the base branch from colin/streaming-physical-writes to main October 22, 2024 00:47

colin-ho closed this Oct 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEAT] Streaming Catalog Writes #2966

[FEAT] Streaming Catalog Writes #2966

colin-ho commented Sep 28, 2024

codspeed-hq bot commented Sep 30, 2024 •

edited

Loading

codecov bot commented Sep 30, 2024 •

edited

Loading

[FEAT] Streaming Catalog Writes #2966

[FEAT] Streaming Catalog Writes #2966

Conversation

colin-ho commented Sep 28, 2024

codspeed-hq bot commented Sep 30, 2024 • edited Loading

CodSpeed Performance Report

Merging #2966 will not alter performance

Summary

codecov bot commented Sep 30, 2024 • edited Loading

Codecov Report

codspeed-hq bot commented Sep 30, 2024 •

edited

Loading

codecov bot commented Sep 30, 2024 •

edited

Loading