-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Promoting Idempotency Through Metadata #47
Comments
I think we're talking about different use cases that are both idempotent:
I think the right choice depends on the job and how it is written. Our data engineers write jobs all the time that overwrite specific partitions and are idempotent, but we also have a separate system for them to run incremental processing. Also, the point about overwrite being slow doesn't really apply unless what you're trying to delete is mixed with other data you're not trying to delete. This isn't intended to support that case. I think there is a use case for incremental processing, so I'd like to see support for writes that also revert a previous commit. I'm skeptical about an API that bridges both Spark and Iceberg, though. To do this, I'd just pass the snapshot to revert through Spark's API as a DataFrameWriter option to Iceberg. Then let Iceberg handle the revert and commit. I think this requires:
|
@omervk, you might want to have a look at #52, which adds single-table transactions. That would enable you to revert a commit and replace its data in the same commit. Also, I've updated the Snapshots section of the spec with a set of operations to help identify what changed in a snapshot. It would be great to hear your take on the update and whether |
Currently, implementing idempotent jobs over Iceberg is done via data-based predicates. For instance, if a job run is presumed to have written the data for
2018-08-05
, you will write something like:However, this may be:
To promote more complete idempotency, we can use the metadata Iceberg provides to revert previous snapshots based on their metadata. If, for instance, Partition P1 writes file F1, and we want to re-run the job that wrote it, we can write P2 which deletes F1 and writes F2 with the new data, effectively reverting P1.
The benefits from this would be:
Note: This would only be usable in cases where we are only appending new data in snapshots, so cases where we also regularly compact or coalesce files may not be supported.
To achieve this, we could:
com.netflix.iceberg.RewriteFiles
operation, but this would keep us at a very low-level, close to Iceberg, and force us to manually manage the files ourselves.com.netflix.iceberg.Rollback
operation, but this only rolls back the previous snapshot, which is something we don't want to be tied to.com.netflix.iceberg.DeleteFiles
operation, but this would create a new snapshot, causing us to either read duplicate or incomplete data.What could be great is an API that lets us have some sort of transaction over both high-level (Spark) and low-level (Iceberg) APIs, so that we could delete the files written in a snapshot and write data using Spark, only then committing the transaction and creating a new snapshot.
@rdblue I would love to hear what you think this kind of API would look like.
The text was updated successfully, but these errors were encountered: