[PERF] Split parquet scan tasks into individual row groups #1799

kevinzwang · 2024-01-19T08:32:10Z

Benchmark results

All times averaged over 5 runs.

Single file read

Read one file in S3 using Ray with 4 worker nodes

Parquet file info:

number of rows: 18,751,674
number of row groups: 18
file size: 711.3 MiB
number of columns: 16

Split threshold (MiB)	# of scan tasks	Read time
32	18	3.84s
64	9	4.21s
128	5	5.47s
256	3	4.45s
512	2	6.06s
1024	1	10.51s

Multi-file workflow

Read and aggregate (Dataframe.count()) 32 files in S3 using Ray with 4 worker nodes

Parquet file info:

total rows: 600,037,902
number of row groups per file: 18
file sizes: 710.7-711.5 MiB
number of columns: 16

Split threshold (MiB)	# of scan tasks per file	Time (4 workers)	Time (8 workers)
32	18	23.83s	14.17s
64	9	24.99s	14.17s
128	5	26.51s	15.23s
256	3	27.96s	16.58s
512	2	30.27s	16.85s
1024	1	26.50s	29.06s

(Averaged over 5 runs)

kevinzwang · 2024-01-19T08:48:20Z

src/daft-scan/src/glob.rs

+                        vec![DataFileSource::AnonymousDataFile {
+                            path: path.to_string(),
+                            chunk_spec: Some(ChunkSpec::Parquet(vec![rg as i64])),
+                            size_bytes: Some(rgm.compressed_size() as u64),


I interpreted size_bytes as representing the file size, which stores the data compressed, so I'm pretty sure .compressed_size() is the right method, but if it's the size of the data, then we should use .total_byte_size().

kevinzwang · 2024-01-19T08:54:17Z

The function materialize_scan_task doesn't seem to make use of the row groups when using the Python storage config. Not sure if that's something to worry about.

See: https://github.com/Eventual-Inc/Daft/blob/main/src/daft-micropartition/src/micropartition.rs#L221

…oups

codecov · 2024-01-24T07:37:21Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (f471738) 85.47% compared to head (4acd096) 85.47%.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #1799   +/-   ##
=======================================
  Coverage   85.47%   85.47%           
=======================================
  Files          55       55           
  Lines        6119     6119           
=======================================
  Hits         5230     5230           
  Misses        889      889

Files	Coverage Δ
daft/context.py	`72.04% <ø> (ø)`

kevinzwang · 2024-01-24T23:32:49Z

The function materialize_scan_task doesn't seem to make use of the row groups when using the Python storage config. Not sure if that's something to worry about.

See: https://github.com/Eventual-Inc/Daft/blob/main/src/daft-micropartition/src/micropartition.rs#L221

Conclusion of conversation with @samster25: don't do row group splitting when using python reader, since it is a legacy feature that we don't care about performance for

kevinzwang · 2024-01-24T23:33:11Z

Also let me know if I should write tests for non-local reads.

src/common/daft-config/src/lib.rs

src/daft-scan/src/scan_task_iters.rs

samster25 · 2024-01-25T00:41:23Z

tests/integration/io/parquet/test_read_row_groups.py

+import daft
+
+FILES = [
+    "tests/assets/parquet-data/mvp.parquet",


We should also add some of the s3 file sources here too. I believe we all these files in s3 as well

I just removed this test and added a fixture to test_reads_public_data.py

src/common/daft-config/src/lib.rs

…oups

daft/context.py

src/common/daft-config/src/lib.rs

clarkzinzow

LGTM overall, just have a few nits and questions about the interaction between splitting and merging scan tasks.

src/daft-scan/src/scan_task_iters.rs

src/common/daft-config/src/lib.rs

src/daft-scan/src/scan_task_iters.rs

tests/integration/io/parquet/test_reads_public_data.py

src/daft-plan/src/planner.rs

src/daft-scan/src/scan_task_iters.rs

…oups

add split scan tasks by row group

e4a599e

kevinzwang requested review from samster25 and jaychia January 19, 2024 08:32

github-actions bot added the performance label Jan 19, 2024

kevinzwang linked an issue Jan 19, 2024 that may be closed by this pull request

[PERF] Read large files into smaller partitions #1765

Closed

3 tasks

kevinzwang marked this pull request as ready for review January 19, 2024 08:34

kevinzwang commented Jan 19, 2024

View reviewed changes

kevinzwang added 3 commits January 22, 2024 16:17

move row group splitting out of GlobScanOperator

8514698

Merge remote-tracking branch 'origin' into kevin/split-parquet-row-gr…

f99c1cc

…oups

fix parquet read row groups and improve split_by_row_groups

199c261

kevinzwang added 3 commits January 24, 2024 00:07

don't split when there is pushdowns.limit

7edacd5

undo parquet read changes

49690db

require native storage config and add test

f5ac466

samster25 reviewed Jan 25, 2024

View reviewed changes

kevinzwang added 2 commits January 25, 2024 16:40

add merge row groups

e01555b

Merge remote-tracking branch 'origin' into kevin/split-parquet-row-gr…

2dd1648

…oups

kevinzwang requested a review from samster25 January 26, 2024 18:29

samster25 reviewed Jan 29, 2024

View reviewed changes

daft/context.py Outdated Show resolved Hide resolved

src/common/daft-config/src/lib.rs Outdated Show resolved Hide resolved

update parquet split config names and defaults

69df969

clarkzinzow reviewed Jan 30, 2024

View reviewed changes

kevinzwang added 2 commits January 31, 2024 14:22

updates from clark review

5627772

Merge remote-tracking branch 'origin' into kevin/split-parquet-row-gr…

a8d3d86

…oups

kevinzwang requested review from samster25 and clarkzinzow January 31, 2024 22:25

kevinzwang added 2 commits January 31, 2024 17:16

fix argument ordering

c3ec965

Merge remote-tracking branch 'origin' into kevin/split-parquet-row-gr…

d19ff03

…oups

updated default configs and remove print

4acd096

kevinzwang merged commit 8aba872 into main Feb 8, 2024
42 checks passed

kevinzwang deleted the kevin/split-parquet-row-groups branch February 8, 2024 01:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PERF] Split parquet scan tasks into individual row groups #1799

[PERF] Split parquet scan tasks into individual row groups #1799

kevinzwang commented Jan 19, 2024 •

edited

Loading

kevinzwang Jan 19, 2024

kevinzwang commented Jan 19, 2024

codecov bot commented Jan 24, 2024 •

edited

Loading

kevinzwang commented Jan 24, 2024

kevinzwang commented Jan 24, 2024

samster25 Jan 25, 2024

kevinzwang Jan 26, 2024

clarkzinzow left a comment •

edited

Loading

[PERF] Split parquet scan tasks into individual row groups #1799

[PERF] Split parquet scan tasks into individual row groups #1799

Conversation

kevinzwang commented Jan 19, 2024 • edited Loading

Benchmark results

Single file read

Multi-file workflow

kevinzwang Jan 19, 2024

Choose a reason for hiding this comment

kevinzwang commented Jan 19, 2024

codecov bot commented Jan 24, 2024 • edited Loading

Codecov Report

kevinzwang commented Jan 24, 2024

kevinzwang commented Jan 24, 2024

samster25 Jan 25, 2024

Choose a reason for hiding this comment

kevinzwang Jan 26, 2024

Choose a reason for hiding this comment

clarkzinzow left a comment • edited Loading

Choose a reason for hiding this comment

kevinzwang commented Jan 19, 2024 •

edited

Loading

codecov bot commented Jan 24, 2024 •

edited

Loading

clarkzinzow left a comment •

edited

Loading