[FEAT] Include file paths as column from read_parquet/csv/json #2953

colin-ho · 2024-09-26T22:19:12Z

Addresses: #2808

This PR enables adding file path as a column from file reads via the file_path_column: str | None parameter. This works by appending a column of the file path to the Table post read + pushdowns.

Having it as a string makes it easy to have unique field name guarantees, i.e. if the user specifies a column name that already exists then an error is thrown.

daft/io/_parquet.py

codspeed-hq · 2024-09-26T22:33:18Z

CodSpeed Performance Report

Merging #2953 will not alter performance

_{Comparing colin/include-path-in-read (fd67611) with main (ab1b772)}

Summary

✅ 17 untouched benchmarks

codecov · 2024-09-26T22:44:03Z

Codecov Report

Attention: Patch coverage is 84.90566% with 24 lines in your changes missing coverage. Please review.

Project coverage is 78.50%. Comparing base (ab1b772) to head (fd67611).
Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
src/daft-plan/src/builder.rs	69.44%	11 Missing ⚠️
src/daft-scan/src/lib.rs	70.58%	10 Missing ⚠️
src/daft-micropartition/src/python.rs	50.00%	1 Missing ⚠️
src/daft-scan/src/glob.rs	98.52%	1 Missing ⚠️
src/daft-scan/src/python.rs	87.50%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2953      +/-   ##
==========================================
+ Coverage   78.47%   78.50%   +0.02%     
==========================================
  Files         610      610              
  Lines       71746    71865     +119     
==========================================
+ Hits        56303    56415     +112     
- Misses      15443    15450       +7

Files with missing lines	Coverage Δ
daft/io/_csv.py	`95.65% <ø> (ø)`
daft/io/_json.py	`91.30% <ø> (ø)`
daft/io/_parquet.py	`86.20% <ø> (ø)`
daft/io/common.py	`85.00% <ø> (ø)`
src/daft-csv/src/read.rs	`99.33% <ø> (ø)`
src/daft-micropartition/src/micropartition.rs	`90.83% <100.00%> (+0.02%)`	⬆️
src/daft-micropartition/src/ops/cast_to_schema.rs	`100.00% <100.00%> (ø)`
src/daft-scan/src/anonymous.rs	`76.92% <100.00%> (+1.24%)`	⬆️
src/daft-scan/src/scan_task_iters.rs	`96.95% <100.00%> (+0.01%)`	⬆️
src/daft-sql/src/table_provider/read_parquet.rs	`96.42% <100.00%> (+0.13%)`	⬆️
... and 5 more

... and 8 files with indirect coverage changes

kindkindk · 2024-09-28T17:32:05Z

This is a good feature. To read file user will add df.read_parquet("...", filename="True"), note the string "True" and reason is explained in top comment.

Can we instead have fileName as boolean in Python API and then Rust can treat it as string ?

colin-ho · 2024-10-03T20:18:11Z

This is a good feature. To read file user will add df.read_parquet("...", filename="True"), note the string "True" and reason is explained in top comment.

Can we instead have fileName as boolean in Python API and then Rust can treat it as string ?

Thanks for the comment! and sorry for late reply.

We could have it as a boolean in the Python API as you mentioned, but I don't quite understand what you mean by having Rust treat it as string. If it was a boolean, then on the Rust side we would simply create a column called "paths" for the file paths. The issue arises when the data already has a column called "paths".

In which case i can think of two solutions:

retry with different variations like "_paths", "file_paths", etc
expose two arguments (include_file_path: bool = False and file_path_column_name: str | None = None), in which we would use the provided file_path_column_name otherwise default to "paths", and subsequently if there is an existing column with that name, throw an error and tell the user to pick a different name. This solution is essentially a more explicit variant of the current implementation, which is just the single file_path_column_name: str | None = None argument.

cc @jaychia for thoughts

desmondcheongzx

Two comments but overall looks good!

desmondcheongzx · 2024-10-11T21:17:31Z

src/daft-plan/src/builder.rs

@@ -658,6 +683,7 @@ impl ParquetScanBuilder {
            ))),
            self.infer_schema,
            self.schema,
+            None,


I think here we should either add a TODO or add file_path_column to the ParquetScanBuilder, then add the argument to the SQL side as well in read_parquet.rs.

src/daft-scan/src/lib.rs

Co-authored-by: Desmond Cheong <[email protected]>

include path in read

6d7a87d

github-actions bot added the enhancement New feature or request label Sep 26, 2024

universalmind303 reviewed Sep 26, 2024

View reviewed changes

daft/io/_parquet.py Outdated Show resolved Hide resolved

test if column name exists

3321272

Colin Ho and others added 3 commits September 26, 2024 15:48

fix typo and reduce 1 alloc

355c05d

Merge branch main into colin/include-path-in-read

c889112

oops

945cfc7

colin-ho requested review from desmondcheongzx and raunakab October 11, 2024 16:15

Colin Ho added 8 commits October 11, 2024 10:10

include in partition spec

d133fed

cleanup

2c59e26

cleanup

ebe6be7

Merge branch main into colin/include-path-in-read

27b517a

add partitioning key

3e222da

partition pruning

115d342

add pushdown tests

d73f6bb

cleanup

b5995d1

desmondcheongzx approved these changes Oct 11, 2024

View reviewed changes

colin-ho and others added 3 commits October 11, 2024 14:32

Update src/daft-scan/src/lib.rs

6c65e63

Co-authored-by: Desmond Cheong <[email protected]>

read sql parquet

67ce742

fix error

fd67611

colin-ho merged commit c694c9e into main Oct 11, 2024
41 checks passed

colin-ho deleted the colin/include-path-in-read branch October 11, 2024 22:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEAT] Include file paths as column from read_parquet/csv/json #2953

[FEAT] Include file paths as column from read_parquet/csv/json #2953

colin-ho commented Sep 26, 2024 •

edited

Loading

codspeed-hq bot commented Sep 26, 2024 •

edited

Loading

codecov bot commented Sep 26, 2024 •

edited

Loading

kindkindk commented Sep 28, 2024

colin-ho commented Oct 3, 2024

desmondcheongzx left a comment

desmondcheongzx Oct 11, 2024

[FEAT] Include file paths as column from read_parquet/csv/json #2953

[FEAT] Include file paths as column from read_parquet/csv/json #2953

Conversation

colin-ho commented Sep 26, 2024 • edited Loading

codspeed-hq bot commented Sep 26, 2024 • edited Loading

CodSpeed Performance Report

Merging #2953 will not alter performance

Summary

codecov bot commented Sep 26, 2024 • edited Loading

Codecov Report

kindkindk commented Sep 28, 2024

colin-ho commented Oct 3, 2024

desmondcheongzx left a comment

Choose a reason for hiding this comment

desmondcheongzx Oct 11, 2024

Choose a reason for hiding this comment

colin-ho commented Sep 26, 2024 •

edited

Loading

codspeed-hq bot commented Sep 26, 2024 •

edited

Loading

codecov bot commented Sep 26, 2024 •

edited

Loading