-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Core: Clear deleted files set between commit attempts for simple transactions #8520
Core: Clear deleted files set between commit attempts for simple transactions #8520
Conversation
Still need to add test cases |
@rdblue As I was looking into writing tests for this, I think in practice we end up not hitting the scenario described (at least as far as I understand). The reason is currently after the transaction is committed successfully the clean up only cleans up uncommitted files. https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/BaseTransaction.java#L455 . The same logic exists at the All that said, I think it's still probably a good invariant to have to clear the files to delete on every transaction attempt in case some new logic gets added in the future. |
@@ -421,6 +421,7 @@ private void commitSimpleTransaction() { | |||
.onlyRetryOn(CommitFailedException.class) | |||
.run( | |||
underlyingOps -> { | |||
deletedFiles.clear(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not safe to do this here because this will clear before the first commit. applyUpdates
will not re-commit everything on the first commit because those are already completed. Instead I think this needs to be called within applyUpdates
just before all operations are re-committed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes you're right, I stepped through the code a bit more, really the applyUpdates is more of a reapplyUpdates
(in case the table state changed). applyUpdates
is a no-op on the first attempt because the individual operations commit has already completed. If the deletedFiles set is cleared before the first commit and that commit succeeds we would never end up cleaning those files based on the cleanup done here https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/BaseTransaction.java#L450. Although I don't know why our tests wouldn't pick up on this, we have tests which validate we cleanup the files we would expect
9176544
to
ebfe7c5
Compare
ebfe7c5
to
ffed1ad
Compare
@amogh-jahagirdar pointed out that we already check and do not delete files that were committed. In that case, we want the set of delete files to be any file that was ever deleted and we don't need to clear the set. Thanks for digging in and finding the right answer, Amogh! |
For transactions, a
deletedFiles
set is maintained which keeps track of the files to delete as part of the most recent commit. For simple transactions, this set is not being cleared between attempts to commit the transaction. It could be possible that on the first transaction attempt, an operation marks certain files for deletion. When the transaction fails and is retried those files may actually be used, and when the retry succeeds those files are cleaned up when they should not be.The set should preserve the invariant of representing the set of files to clear after a transaction is committed, thus this PR clears the set between transaction commit attempts.