[FEAT] Add streaming + parallel CSV reader, with decompression support. #1501

clarkzinzow · 2023-10-18T17:06:21Z

This PR adds streaming + parallel CSV reading and parsing, along with support for streaming decompression. In particular, this PR:

Adds support for streaming decompression for brotli, bz, deflate, gzip, lzma, xz, zlib, and zstd.
Performs chunk-based streaming CSV reads, filling up a small buffer of unparsed records.
Pipelines chunk-based CSV parsing with reading by spawning Tokio + rayon parsing tasks.
Performances chunk parsing, as well as column parsing within a chunk, in parallel on the rayon threadpool.
Changes schema inference to involve an (at most) 1 MiB file peak rather than a full file read.
Gathers a mean row size in bytes estimate during schema inference and propagates this estimate back to the reader.
Unifies local and cloud reads + schema inference.
Adds thorough Rust-side local + cloud test coverage.

The streaming + parallel reading + parsing leads to a 4-8x speed up over the pyarrow reader and the previous non-parallel reader when benchmarking large file (~1 GB) reads, while also resulting in lower memory utilization due to the streaming reading + parsing.

TODOs (follow-up PRs)

Add snappy decompression support (need to essentially do something like this)

…increase test coverage.

codecov · 2023-10-18T17:31:09Z

Codecov Report

Merging #1501 (df97d66) into main (bdd2128) will increase coverage by 0.01%.
Report is 1 commits behind head on main.
The diff coverage is 100.00%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1501      +/-   ##
==========================================
+ Coverage   74.74%   74.76%   +0.01%     
==========================================
  Files          60       60              
  Lines        6118     6130      +12     
==========================================
+ Hits         4573     4583      +10     
- Misses       1545     1547       +2

Files	Coverage Δ
daft/execution/execution_step.py	`92.82% <ø> (ø)`
daft/io/_csv.py	`95.00% <100.00%> (ø)`
daft/runners/partitioning.py	`80.70% <100.00%> (+0.34%)`	⬆️
daft/table/table.py	`81.94% <ø> (ø)`
daft/table/table_io.py	`95.83% <ø> (-0.70%)`	⬇️

... and 7 files with indirect coverage changes

samster25

Looks good! And you were right haha, I think we should probably pull out the estimated rows commit until after we do the MicroPartition and scan operator work.

daft/table/schema_inference.py

src/daft-csv/src/metadata.rs

samster25 · 2023-10-18T19:42:18Z

src/daft-csv/src/metadata.rs

+                // default to Utf8 for conflicting datatypes (e.g bool and int)
+                DataType::Utf8
+            }
+        }


is there any merging we have to do the the temporal types?

I'm assuming that we'll have a variety of follow-ups there, yes.

samster25 · 2023-10-18T20:05:36Z

src/daft-csv/src/read.rs

+                schema,
+                // Default buffer size of 512 KiB.
+                buffer_size.unwrap_or(512 * 1024),
+                // Default chunk size of 64 KiB.


Do these constants make sense given its a local buffered file?

I have a TODO to benchmark locally and tweak these, but I'm assuming that tweaking these won't matter as much for local reads as they do for cloud reads. I can look into tweaking these if you'd like!

src/daft-csv/src/read.rs

This reverts commit 8acf496.

clarkzinzow · 2023-10-19T01:26:39Z

@samster25 I reverted the estimated row size piping from schema inference, added a good bit more test coverage, and addressed your primary review comments, PTAL!

samster25

Looks good! However we should also test everything through the python side as well via Dataframe tests if that isn't already done!

src/daft-csv/src/read.rs

daft/runners/partitioning.py

…t are likely to not require reallocation.

clarkzinzow added 3 commits October 18, 2023 09:47

Add streaming + chunk-parallel + column-parallel CSV reads.

4a6e92a

Pipe through estimated row size.

8acf496

Add streaming decompression support, move to stream-based execution, …

d0cd093

…increase test coverage.

clarkzinzow requested review from samster25 and jaychia October 18, 2023 17:06

github-actions bot added the enhancement New feature or request label Oct 18, 2023

clarkzinzow changed the title ~~[FEAT] Add streaming + parallel CSV reader.~~ [FEAT] Add streaming + parallel CSV reader, with decompression support. Oct 18, 2023

Clean up stream error handling.

46ec9ef

clarkzinzow force-pushed the clark/streaming-parallel-csv-read branch from c29cbbe to 46ec9ef Compare October 18, 2023 20:03

samster25 reviewed Oct 18, 2023

View reviewed changes

clarkzinzow added 5 commits October 18, 2023 16:41

Set the default max chunks in flight to be 2x # of cores.

d588e9a

Schema inference clean up + test coverage.

7537344

Revert "Pipe through estimated row size."

1711a47

This reverts commit 8acf496.

Make autogenerated column name prefix a const var.

1e9d02c

Add edge case and error handling test coverage.

f257738

samster25 approved these changes Oct 19, 2023

View reviewed changes

src/daft-csv/src/read.rs Show resolved Hide resolved

src/daft-csv/src/read.rs Show resolved Hide resolved

daft/runners/partitioning.py Outdated Show resolved Hide resolved

clarkzinzow added 3 commits October 19, 2023 12:46

Expand test coverage.

442a447

Using runing mean + standard deviation to allocate record buffers tha…

abf5f0e

…t are likely to not require reallocation.

Remove max_chunks_in_flight from being user-configurable.

df97d66

clarkzinzow merged commit ad829c9 into main Oct 20, 2023
24 checks passed

clarkzinzow deleted the clark/streaming-parallel-csv-read branch October 20, 2023 01:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEAT] Add streaming + parallel CSV reader, with decompression support. #1501

[FEAT] Add streaming + parallel CSV reader, with decompression support. #1501

clarkzinzow commented Oct 18, 2023 •

edited

Loading

codecov bot commented Oct 18, 2023 •

edited

Loading

samster25 left a comment

samster25 Oct 18, 2023

clarkzinzow Oct 19, 2023

samster25 Oct 18, 2023

clarkzinzow Oct 19, 2023

clarkzinzow commented Oct 19, 2023

samster25 left a comment

[FEAT] Add streaming + parallel CSV reader, with decompression support. #1501

[FEAT] Add streaming + parallel CSV reader, with decompression support. #1501

Conversation

clarkzinzow commented Oct 18, 2023 • edited Loading

TODOs (follow-up PRs)

codecov bot commented Oct 18, 2023 • edited Loading

Codecov Report

samster25 left a comment

Choose a reason for hiding this comment

samster25 Oct 18, 2023

Choose a reason for hiding this comment

clarkzinzow Oct 19, 2023

Choose a reason for hiding this comment

samster25 Oct 18, 2023

Choose a reason for hiding this comment

clarkzinzow Oct 19, 2023

Choose a reason for hiding this comment

clarkzinzow commented Oct 19, 2023

samster25 left a comment

Choose a reason for hiding this comment

clarkzinzow commented Oct 18, 2023 •

edited

Loading

codecov bot commented Oct 18, 2023 •

edited

Loading