[FEAT][1/2] Support Iceberg renaming of columns #1937

jaychia · 2024-02-21T03:16:10Z

Summary

Support field_id renaming of Parquet files along the codepath:

IcebergScanOperator
Generates ScanTasks, each containing the field_id_mapping: Arc<{i32: Field}>
Propagated to workers through the ScanWithTask instruction object
Micropartitions are created with MicroPartition::from_scan_task
This then calls into read_parquet_into_micropartition
a. If statistics are available, it will create an unloaded MicroPartition by creating a new ScanTask (hydrated with statistics) and then calling MicroPartition::new_unloaded(new_scan_task).
b. Otherwise, it falls back into read_parquet_bulk, which has been modified to correctly handle field_id_mapping

This PR ensures that when data/statistics are read from Parquet files, we correctly apply renaming according to field_id_mapping.

Reviewer Notes

A lot of the errors caught/triggered by this PR has to do with mismatches between the fields (names/metadata) on our schemas and on our Series objects.

Keeping those two in sync is fairly challenging with the way our code is currently structured.

The approach taken to try and fix this is:

Try to use the same logic for field_id renaming across Series and Schemas
When reading data from Parquet -> arrow2 -> Daft Series/Schema, perform a post-processing step to remove any field metadata that was retrieved from the Parquet files.

However I do think that this is a fairly error-prone situation. Not sure what the best approach is though.

Drive-By

Refactors to clean-up MicroPartitions/ScanTasks and schemas:

Refactored MicroPartition::new_unloaded: it no longer accepts a schema argument; instead internally it will just use the ScanTask's .materialized_schema()
Refactored read_parquet_into_micropartition to significantly reduce code deduplication

Remaining todos:

Fix logic with column pruning (need to apply column pruning after applying the field ID mappings)
Perform correct renaming for statistics parsing from Parquet metadata
Perform recursive renaming for Series and for Schema

codecov · 2024-02-21T03:28:25Z

Codecov Report

Attention: Patch coverage is 0% with 4 lines in your changes are missing coverage. Please review.

Project coverage is 84.65%. Comparing base (3e0e334) to head (715a9e6).
Report is 1 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1937      +/-   ##
==========================================
- Coverage   84.68%   84.65%   -0.03%     
==========================================
  Files          57       57              
  Lines        6293     6295       +2     
==========================================
  Hits         5329     5329              
- Misses        964      966       +2

Files	Coverage Δ
daft/iceberg/iceberg_scan.py	`0.00% <0.00%> (ø)`

src/daft-table/src/lib.rs

jaychia · 2024-02-22T02:52:51Z

Cargo.toml

@@ -112,10 +112,11 @@ tokio-util = "0.7.8"
 url = "2.4.0"

 [workspace.dependencies.arrow2]
-# branch = "daft-fork"
+# TODO: Update this to daft-fork
+# branch = "jay/fd-add-rename-test"


Change was added here to populate field_ids in the inferred arrow2 Field's metadata

Once this PR is approved, I'll update this to point to a rebased daft-fork again

jaychia · 2024-02-24T08:53:09Z

src/daft-core/src/datatypes/field.rs

@@ -129,3 +129,16 @@ impl Display for Field {
        write!(f, "{}#{}", self.name, self.dtype)
    }
 }
+
+impl PartialEq for Field {


I implemented a custom PartialEq and Hash for Field because we were getting a bunch of issues with Schema::eq in both Rust and Python, now that our arrow2 reader is propagating field_id into the metadata field.

Not sure if this is the best idea though. Reviewers should feel free to comment!

In Iceberg the tables are projected using field-IDs. Even if the column is renamed (and Iceberg is lazy, so the table is not rewritten), it should still read the original column.

Adds support for renaming of nested columns (columns renamed under structs and lists) **Reviewers to note: this is a follow-on PR to #1937** --------- Co-authored-by: Jay Chia <[email protected]@users.noreply.github.com>

jaychia · 2024-03-05T01:34:36Z

src/daft-micropartition/src/micropartition.rs

@@ -693,6 +708,68 @@ pub(crate) fn read_json_into_micropartition(
    }
 }

+#[allow(clippy::too_many_arguments)]
+fn _read_parquet_into_loaded_micropartition(


This is for code deduplication in read_parquet_into_micropartition

jaychia · 2024-03-05T01:37:13Z

src/daft-micropartition/src/ops/cast_to_schema.rs

-                self.metadata.clone(),
-                pruned_statistics.expect("Unloaded MicroPartition should have statistics"),
-            )),
+            TableState::Unloaded(scan_task) => {


An Unloaded MicroPartition's schema is actually just defined by its ScanTask's schema, so we should be replacing the ScanTask's schema directly.

jaychia · 2024-03-05T01:38:27Z

src/daft-parquet/src/file.rs

@@ -95,6 +102,214 @@ where
    }
 }

+fn resolve_dtype_recursively(


The resolve_*_recursively code is pretty messy. Happy to take on work to convert it into a Visitor pattern if reviewers think its necessary.

jaychia · 2024-03-08T04:26:22Z

Closing in favor of better approach in #1990

github-actions bot added the enhancement New feature or request label Feb 21, 2024

jaychia force-pushed the jay/fd-add-rename-test branch from 598ab84 to 56562e5 Compare February 21, 2024 19:57

jaychia commented Feb 22, 2024

View reviewed changes

src/daft-table/src/lib.rs Outdated Show resolved Hide resolved

jaychia commented Feb 22, 2024

View reviewed changes

jaychia force-pushed the jay/fd-add-rename-test branch from 615dd6c to 412ffbc Compare February 24, 2024 08:09

jaychia requested review from samster25 and clarkzinzow February 24, 2024 08:32

jaychia commented Feb 24, 2024

View reviewed changes

jaychia changed the title ~~[FEAT] Support Iceberg renaming of columns~~ [FEAT][1/2] Support Iceberg renaming of columns Feb 27, 2024

jaychia mentioned this pull request Feb 27, 2024

[FEAT][2/2] Support Iceberg renaming of **nested** columns #1956

Merged

jaychia force-pushed the jay/fd-add-rename-test branch 2 times, most recently from 71e5471 to 079ef11 Compare March 4, 2024 19:41

Fokko and others added 17 commits March 4, 2024 11:43

Add example with rename column

97eb0e8

In Iceberg the tables are projected using field-IDs. Even if the column is renamed (and Iceberg is lazy, so the table is not rewritten), it should still read the original column.

init: perform Daft-level casting

967f475

Working e2e example for integration test

2a231c7

Pass around BTreeMap of field_id to field instead

4061fd0

Arc the BTreeMap for cheaper clones across parallel tokio tasks

16f55b3

Update to use field_ids from field metadata in arrow2 fork

05f776e

Perform renaming on schema without a {name: name} mapping

1e82085

Add todos

b5d4037

Add todo for statistics in ScanTask::new

23e1265

Add todo for statistics in creation of unloaded MP

a83a203

Add handling for filter test-case and statistics parsing

9fd943c

rm TODO for MicroPartition::new_unloaded call

16251eb

Fix test

837aaf3

Add breaking test for column projections

56e483e

Add fix for column pruning

c593bd4

Refactor read_parquet_into_micropartition

9e10b88

Fix add_monotonically_increasing_id

c1bc5f6

Jay Chia added 7 commits March 4, 2024 11:57

Refactor MicroPartition::new_unloaded and MicroPartition.cast_to_schema

308f634

Remove usage of MicroPartition.cast_to_schema in read function

3f7608a

Add tests for read time casts of Parquet file

f9a7ebb

Cleanup materialize_scan_task and documentation of MicroPartition

8df1d4c

Impl custom PartialEq and Hash on Field that ignores metadata

8a6a264

Grab all field IDs recursively in pyiceberg

9989c63

Fix after rebase

570764a

jaychia force-pushed the jay/fd-add-rename-test branch from 079ef11 to 570764a Compare March 4, 2024 20:02

jaychia commented Mar 5, 2024

View reviewed changes

jaychia closed this Mar 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEAT][1/2] Support Iceberg renaming of columns #1937

[FEAT][1/2] Support Iceberg renaming of columns #1937

jaychia commented Feb 21, 2024 •

edited

Loading

codecov bot commented Feb 21, 2024 •

edited

Loading

jaychia Feb 22, 2024

jaychia Feb 27, 2024

jaychia Feb 24, 2024

jaychia Mar 5, 2024

jaychia Mar 5, 2024

jaychia Mar 5, 2024

jaychia commented Mar 8, 2024

[FEAT][1/2] Support Iceberg renaming of columns #1937

[FEAT][1/2] Support Iceberg renaming of columns #1937

Conversation

jaychia commented Feb 21, 2024 • edited Loading

Summary

Reviewer Notes

Drive-By

Remaining todos:

codecov bot commented Feb 21, 2024 • edited Loading

Codecov Report

jaychia Feb 22, 2024

Choose a reason for hiding this comment

jaychia Feb 27, 2024

Choose a reason for hiding this comment

jaychia Feb 24, 2024

Choose a reason for hiding this comment

jaychia Mar 5, 2024

Choose a reason for hiding this comment

jaychia Mar 5, 2024

Choose a reason for hiding this comment

jaychia Mar 5, 2024

Choose a reason for hiding this comment

jaychia commented Mar 8, 2024

jaychia commented Feb 21, 2024 •

edited

Loading

codecov bot commented Feb 21, 2024 •

edited

Loading