Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[CHORE] Simplify cast to schema (#1982)
This PR makes some changes to how we use `cast_to_schema` in our codebase, and also some subtle fixes which makes our code easier to refactor later on for field IDs. ## Reducing user error with schemas I added a bunch of documentation to our structs, detailing what the correct semantics for each `schema` field is on ScanTask and MicroPartition. Additinally: 1. `MicroPartition::new_unloaded` no longer takes in as input a schema. Instead, the unloaded MicroPartition's schema is simply its ScanTask's `materialized_schema`. 2. `materialize_scan_task` now does not take as input a `cast_to_schema` argument. Instead, it just uses the ScanTask's `materialized_schema` and `partition_spec()` for the fill map to correctly coerce all the materialized Tables. ## Refactoring the `read_parquet_into_micropartition` megafunction **Share code**: Move shared code into a `_read_parquet_into_loaded_micropartition` helper, which is called in 2 locations from `read_parquet_into_micropartition`. **Fix the Unloaded case**: Previously in the unloaded micropartition case, we were creating a `ScanTask` using the schema inferrred from the Parquet file. This is *wrong* behavior! I added a new `catalog_provided_schema` argument that is now correctly used when creating MicroPartitions in this function, in both the loaded and unloaded case. **Casting**: we now perform casting inside of `read_parquet_into_micropartition`: 1. When we eagerly create loaded micropartitions, we call `cast_to_schema_with_fill` on the materialized tables 2. When we create unloaded micropartitions, we correctly create the ScanTask with the right schema, partition_spec and column selection so that later on when we materialize the MicroPartition, we correctly call `cast_to_schema_with_fill` on each Table. Also called `cast_to_schema_with_fill` on the stats. --------- Co-authored-by: Jay Chia <[email protected]@users.noreply.github.com>
- Loading branch information