-
Notifications
You must be signed in to change notification settings - Fork 175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEAT] Include file paths as column from read_parquet/csv/json #2953
Conversation
CodSpeed Performance ReportMerging #2953 will not alter performanceComparing Summary
|
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2953 +/- ##
==========================================
+ Coverage 78.47% 78.50% +0.02%
==========================================
Files 610 610
Lines 71746 71865 +119
==========================================
+ Hits 56303 56415 +112
- Misses 15443 15450 +7
|
This is a good feature. To read file user will add Can we instead have fileName as boolean in Python API and then Rust can treat it as string ? |
Thanks for the comment! and sorry for late reply. We could have it as a boolean in the Python API as you mentioned, but I don't quite understand what you mean by having Rust treat it as string. If it was a boolean, then on the Rust side we would simply create a column called "paths" for the file paths. The issue arises when the data already has a column called "paths". In which case i can think of two solutions:
cc @jaychia for thoughts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two comments but overall looks good!
src/daft-plan/src/builder.rs
Outdated
@@ -658,6 +683,7 @@ impl ParquetScanBuilder { | |||
))), | |||
self.infer_schema, | |||
self.schema, | |||
None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think here we should either add a TODO or add file_path_column
to the ParquetScanBuilder
, then add the argument to the SQL side as well in read_parquet.rs
.
Co-authored-by: Desmond Cheong <[email protected]>
Addresses: #2808
This PR enables adding file path as a column from file reads via the
file_path_column: str | None
parameter. This works by appending a column of the file path to theTable
post read + pushdowns.Having it as a string makes it easy to have unique field name guarantees, i.e. if the user specifies a column name that already exists then an error is thrown.