Skip to content

Commit

Permalink
[CHORE] Simplify cast to schema (#1982)
Browse files Browse the repository at this point in the history
This PR makes some changes to how we use `cast_to_schema` in our
codebase, and also some subtle fixes which makes our code easier to
refactor later on for field IDs.

## Reducing user error with schemas

I added a bunch of documentation to our structs, detailing what the
correct semantics for each `schema` field is on ScanTask and
MicroPartition.

Additinally:

1. `MicroPartition::new_unloaded` no longer takes in as input a schema.
Instead, the unloaded MicroPartition's schema is simply its ScanTask's
`materialized_schema`.
2. `materialize_scan_task` now does not take as input a `cast_to_schema`
argument. Instead, it just uses the ScanTask's `materialized_schema` and
`partition_spec()` for the fill map to correctly coerce all the
materialized Tables.

## Refactoring the `read_parquet_into_micropartition` megafunction

**Share code**: Move shared code into a
`_read_parquet_into_loaded_micropartition` helper, which is called in 2
locations from `read_parquet_into_micropartition`.

**Fix the Unloaded case**: Previously in the unloaded micropartition
case, we were creating a `ScanTask` using the schema inferrred from the
Parquet file. This is *wrong* behavior! I added a new
`catalog_provided_schema` argument that is now correctly used when
creating MicroPartitions in this function, in both the loaded and
unloaded case.

**Casting**: we now perform casting inside of
`read_parquet_into_micropartition`:

1. When we eagerly create loaded micropartitions, we call
`cast_to_schema_with_fill` on the materialized tables
2. When we create unloaded micropartitions, we correctly create the
ScanTask with the right schema, partition_spec and column selection so
that later on when we materialize the MicroPartition, we correctly call
`cast_to_schema_with_fill` on each Table. Also called
`cast_to_schema_with_fill` on the stats.

---------

Co-authored-by: Jay Chia <[email protected]@users.noreply.github.com>
  • Loading branch information
jaychia and Jay Chia authored Mar 6, 2024
1 parent 9ed9b3f commit 5af4ee9
Show file tree
Hide file tree
Showing 6 changed files with 182 additions and 124 deletions.
1 change: 1 addition & 0 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions src/daft-micropartition/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ pyo3 = {workspace = true, optional = true}
pyo3-log = {workspace = true}
serde = {workspace = true}
snafu = {workspace = true}
tokio = {workspace = true}

[features]
default = ["python"]
Expand Down
Loading

0 comments on commit 5af4ee9

Please sign in to comment.