Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[BUG] Fix parquet reads with limit across row groups (#2751)
When reading local parquet files containing multiple row groups with a limit applied, sometimes the resulting table does not respect the given limit, causing errors such as `DaftError::ValueError While building a Table with Table::new_with_size, we found that the Series lengths did not match. Series named: col had length: 2048 vs the specified Table length: 1034` to be thrown. The issue was a small bug where each row group range being read would take the global limit passed into the parquet read, instead of the pre-computed row group limit, which is aware of how many rows had been read by previous row groups. This caused the parquet reader to read more rows from a row group than specified. To fix this, we pass the pre-computed row group limit properly to the reader. For example, consider a parquet file with the following layout: ``` Column: col -------------------------------------------------------------------------------- page type enc count avg size size rows nulls min / max 0-D dict S _ 1 5.00 B 5 B 0-1 data S R 1024 0.01 B 11 B 0 "b" / "b" 1-D dict S _ 1 5.00 B 5 B 1-1 data S R 1024 0.01 B 11 B 0 "b" / "b" 2-D dict S _ 1 5.00 B 5 B 2-1 data S R 1024 0.01 B 11 B 0 "b" / "b" 3-D dict S _ 1 5.00 B 5 B 3-1 data S R 1024 0.01 B 11 B 0 "b" / "b" ``` When applying a `.limit(1050)` over this parquet file, with the bug, we would read 1024 rows each from row groups 0 and 1 (data pages `0-1` and `1-1`). Row groups 2 and 3 are skipped because the pre-computed row ranges sees that we have the required `1050` rows in the first two row groups. However, the pre-computed row ranges are aware that we only need 26 entries from row group 1, so we simply pass this information correctly into the reader.
- Loading branch information