Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core: Clear deleted files set between commit attempts for simple transactions #8520

Conversation

amogh-jahagirdar
Copy link
Contributor

For transactions, a deletedFiles set is maintained which keeps track of the files to delete as part of the most recent commit. For simple transactions, this set is not being cleared between attempts to commit the transaction. It could be possible that on the first transaction attempt, an operation marks certain files for deletion. When the transaction fails and is retried those files may actually be used, and when the retry succeeds those files are cleaned up when they should not be.

The set should preserve the invariant of representing the set of files to clear after a transaction is committed, thus this PR clears the set between transaction commit attempts.

@github-actions github-actions bot added the core label Sep 7, 2023
@amogh-jahagirdar
Copy link
Contributor Author

Still need to add test cases

@amogh-jahagirdar
Copy link
Contributor Author

amogh-jahagirdar commented Sep 11, 2023

@rdblue As I was looking into writing tests for this, I think in practice we end up not hitting the scenario described (at least as far as I understand). The reason is currently after the transaction is committed successfully the clean up only cleans up uncommitted files. https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/BaseTransaction.java#L455 . The same logic exists at the SnapshotProducer level as well. https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java#L931 . So if a manifest file were to get reused and successfully committed in the transaction, it would be part of the committedFiles and never be deleted. Let me know if you see a scenario where our current logic wouldn't be sufficient.

All that said, I think it's still probably a good invariant to have to clear the files to delete on every transaction attempt in case some new logic gets added in the future.

@@ -421,6 +421,7 @@ private void commitSimpleTransaction() {
.onlyRetryOn(CommitFailedException.class)
.run(
underlyingOps -> {
deletedFiles.clear();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not safe to do this here because this will clear before the first commit. applyUpdates will not re-commit everything on the first commit because those are already completed. Instead I think this needs to be called within applyUpdates just before all operations are re-committed.

Copy link
Contributor Author

@amogh-jahagirdar amogh-jahagirdar Sep 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you're right, I stepped through the code a bit more, really the applyUpdates is more of a reapplyUpdates (in case the table state changed). applyUpdates is a no-op on the first attempt because the individual operations commit has already completed. If the deletedFiles set is cleared before the first commit and that commit succeeds we would never end up cleaning those files based on the cleanup done here https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/BaseTransaction.java#L450. Although I don't know why our tests wouldn't pick up on this, we have tests which validate we cleanup the files we would expect

@rdblue
Copy link
Contributor

rdblue commented Sep 20, 2023

@amogh-jahagirdar pointed out that we already check and do not delete files that were committed. In that case, we want the set of delete files to be any file that was ever deleted and we don't need to clear the set. Thanks for digging in and finding the right answer, Amogh!

@rdblue rdblue closed this Sep 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants