[PERF] Add a parallel local CSV reader #3055

desmondcheongzx · 2024-10-16T01:55:18Z

Adds a parallel CSV reader to speed up ingestion of CSV. This CSV reader is used for local and uncompressed CSV files.

The approach in this PR adapts some ideas laid out in [1], namely the idea of adjusting chunks of the CSV file with its neighbouring chunks so that each chunk contains whole CSV records that can be decoded in parallel.

However, as it turns out, the majority of performance gains came from the use of buffer pools to minimize memory allocations.

Some performance numbers

We consider a simple case of reading and performing .collect() on a CSV file with 10^8 rows of 9 fields: 3 string fields, 5 int64 fields, and 1 double field. This file is roughly 5GB in size.

Non-native executor:                 38.71140212500177s
Non-native executor, new CSV reader: 7.432862582994858s
Native executor:                     44.55550079200475s
Native executor, new CSV reader:     3.639032250008313s

This represents a roughly 12x speedup on CSV reads for the native executor.

Followups

Unfortunately, with this new CSV reader, the new native csv reader no longer shows a small and stable amount of memory usage during an aggregation. Memray shows a steady increase in resident set size that disappears once the aggregation completes. For what it's worth, parsing the whole CSV file using this reader then dumping the results does not result in any memory increase. So it's possible we're simply not passing results to downstream consumers in a way that they expect.

[1]: Ge, Chang et al. “Speculative Distributed CSV Data Parsing for Big Data Analytics.” Proceedings of the 2019 International Conference on Management of Data (2019).

Rearchitect Disable feature Enable; Add inference and estimators Cleanup

codspeed-hq · 2024-10-16T02:45:34Z

CodSpeed Performance Report

Merging #3055 will not alter performance

_{Comparing desmondcheongzx:local-csv-reader (8106072) with main (6569cb6)}

Summary

✅ 17 untouched benchmarks

src/daft-csv/src/lib.rs

src/daft-csv/src/local.rs

samster25

Great work @desmondcheongzx!! Finally through :)

Just some minor feedback and questions but should be good to merge after

src/daft-csv/src/local.rs

samster25 · 2024-10-24T00:20:03Z

src/daft-csv/src/local.rs

+    include_columns: Option<Vec<String>>,
+    predicate: Option<Arc<Expr>>,
+    limit: Option<usize>,
+) -> DaftResult<Vec<Table>>


I think this could technically return an Iterator of Tables. so we can start emiting out smaller chunks of tables as we're still processing the Chunk. But I'm going to say it's out of scope for this PR 🥹

Haha I did give it a shot and got a table iterator. What I didn't get to was

start emiting out smaller chunks of tables as we're still processing the Chunk

so it kinda just sat around until someone called .next() on it. 2x slowdown. I think you're right, let's punt it.

samster25 · 2024-10-24T00:20:44Z

src/daft-csv/src/local.rs

+    }
+}
+
+const NEWLINE: u8 = b'\n';


Should these be user configurable?

As discussed offline, Daft's read_csv API currently does not take in non-\n record terminators. I added a TODO as a comment as a reminder that we should add this option to the API and pass it down here.

The local CSV reader currently makes upfront buffer allocations (80 MiB for file slabs and 80 MiB for CSV buffers). This unnecessarily blows up the read time for small CSV files which don't use so many buffers. Since the local CSV reader allocates additional buffers as needed, we can remove all upfront allocations without affecting anything else in the implementation of the reader. This speeds up reads of small files. At the same time, I benchmarked the performance of the reader against the test case described in #3055 and found no consistent slowdown without upfront comparisons.

Prototype

c16badc

Rearchitect Disable feature Enable; Add inference and estimators Cleanup

desmondcheongzx mentioned this pull request Oct 16, 2024

[PERF] Add a parallel local CSV reader #2772

Closed

github-actions bot added the performance label Oct 16, 2024

desmondcheongzx changed the title ~~[PERF] Prototype~~ [PERF] Add a parallel local CSV reader Oct 16, 2024

desmondcheongzx added 3 commits October 15, 2024 19:33

Fix rebase

58ad170

Remove unused deps

1c4881c

Remove rebase artifact

6cbd1f8

desmondcheongzx and others added 19 commits October 15, 2024 21:20

Fix tests

3136a1c

Add fallback

61b58c0

Cleanup

ef23acb

Fix comment

0449339

wip

b64a135

Make it work

c5c8cbd

Remove unused deps

423c2f0

Rework state machine

11850aa

Remove option wrapping around file slab buffers

516596a

Remove Arcs

c44871d

clean up multislice reader

68d520b

rwlock example

f617d45

Rwlocking

8e86310

Fix poisoning

0df4f04

Clean up buffer pools

024fbd7

Enforce ordering of results with oneshot channels

f9a0cb4

Avoid concat

16c3e0e

Clean up unused deps

93350f0

Clean up

b87caba

samster25 reviewed Oct 23, 2024

View reviewed changes

samster25 added 2 commits October 22, 2024 21:59

minor cleanup

1cac1a0

async works baby

84e74ad

samster25 and others added 9 commits October 22, 2024 23:01

reset validation state

554744a

remove box stream

105daac

error clean up

2f04c6d

Apply local limit; apply max chunks in flight; cleanup

efb9180

Address remaining comments

0559142

Errr didn't git add some things

43c8a6a

Cleanup

eac91b2

Remove crossbeam dep

7006a47

Move pool stuff into their own mod

9d35c1b

samster25 approved these changes Oct 24, 2024

View reviewed changes

desmondcheongzx added 4 commits October 24, 2024 01:14

use smoll vec

42ef62d

Apply global limit within streaming path

2343c06

Add todo to modify read_csv to take in configurable record terminator

f5a80b4

Merge remote-tracking branch 'daft/main' into local-csv-reader

ffd28de

desmondcheongzx enabled auto-merge (squash) October 24, 2024 09:47

Fix limit with predicate

8106072

desmondcheongzx merged commit 5b450fb into Eventual-Inc:main Oct 24, 2024
38 checks passed

desmondcheongzx deleted the local-csv-reader branch October 24, 2024 19:56

desmondcheongzx mentioned this pull request Nov 7, 2024

[PERF] Remove upfront buffer allocations for local CSV reader #3242

Merged

colin-ho mentioned this pull request Nov 14, 2024

[CSV Reader] Set better buffer/chunk size defaults for local reader. #1512

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PERF] Add a parallel local CSV reader #3055

[PERF] Add a parallel local CSV reader #3055

desmondcheongzx commented Oct 16, 2024 •

edited

Loading

codspeed-hq bot commented Oct 16, 2024 •

edited

Loading

samster25 left a comment

samster25 Oct 24, 2024

desmondcheongzx Oct 24, 2024

samster25 Oct 24, 2024

desmondcheongzx Oct 24, 2024

[PERF] Add a parallel local CSV reader #3055

[PERF] Add a parallel local CSV reader #3055

Conversation

desmondcheongzx commented Oct 16, 2024 • edited Loading

Some performance numbers

Followups

codspeed-hq bot commented Oct 16, 2024 • edited Loading

CodSpeed Performance Report

Merging #3055 will not alter performance

Summary

samster25 left a comment

Choose a reason for hiding this comment

samster25 Oct 24, 2024

Choose a reason for hiding this comment

desmondcheongzx Oct 24, 2024

Choose a reason for hiding this comment

samster25 Oct 24, 2024

Choose a reason for hiding this comment

desmondcheongzx Oct 24, 2024

Choose a reason for hiding this comment

desmondcheongzx commented Oct 16, 2024 •

edited

Loading

codspeed-hq bot commented Oct 16, 2024 •

edited

Loading