You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
(this is very similar in philosophy to #47 and it would be good to read that before this, same caveats applying)
A job that wants to read only new data since the last time it ran must understand what the high-water mark was and read new data from its source based on a predicate. For instance:
However, we can base our reads on the building-up of snapshots along time, so if our snapshots
are S1, S2, S3 and S4 and the last snapshot we processed was S1, we can read the new data from S2, S3 and S4 and skip the filtering completely. This would essentially make our high-water mark metadata-based, rather than data-based.
This can be achieved using the low-level Iceberg API, but not using the Spark API, which would be a great addition to the project.
This can be done fairly easily by adding key-value properties when reading with Spark. We plan to do this to implement AS OF SYSTEM TIME SQL statements as well. You'd pass extra information and in the IcebergSource, use it to customize the scan.
Scans don't currently support filtering by what is "new" in a snapshot (or multiple snapshots) but that should be easy to add by extending the TableScan interface and the underlying BaseTableScan implementation.
A manifest list is created for every commit attempt. Before this update,
the same file was used, which caused retries to fail trying to create
the same list file. This uses a new location for every manifest list,
keeps track of old lists, and cleans up unused lists after a commit
succeeds.
(this is very similar in philosophy to #47 and it would be good to read that before this, same caveats applying)
A job that wants to read only new data since the last time it ran must understand what the high-water mark was and read new data from its source based on a predicate. For instance:
However, we can base our reads on the building-up of snapshots along time, so if our snapshots
are S1, S2, S3 and S4 and the last snapshot we processed was S1, we can read the new data from S2, S3 and S4 and skip the filtering completely. This would essentially make our high-water mark metadata-based, rather than data-based.
This can be achieved using the low-level Iceberg API, but not using the Spark API, which would be a great addition to the project.
Here's a sketch of how this API may look like:
Note: Specifying the list of snapshots would also let this API support other use cases, such as parallel-processing of snapshots, etc.
The text was updated successfully, but these errors were encountered: