-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spark: Avoid extra copies of manifests while optimizing V2 tables #8928
Spark: Avoid extra copies of manifests while optimizing V2 tables #8928
Conversation
if (formatVersion == 1) { | ||
assertThat(manifests.get(0).path()).isNotEqualTo(firstNewManifest.path()); | ||
assertThat(manifests.get(1).path()).isNotEqualTo(secondNewManifest.path()); | ||
} else { | ||
assertThat(manifests.get(0).path()).isEqualTo(firstNewManifest.path()); | ||
assertThat(manifests.get(1).path()).isEqualTo(secondNewManifest.path()); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[minor] shouldn't we check if it's format = 1 and snapshotIdInheritence enabled ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are actually two different tests. This one called basicReplacement and tests the default behavior. The one below explicitly validates enabled snapshot ID inheritance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall. Left minor suggestions. Thanks @aokolnychyi !
ops.current() | ||
.propertyAsBoolean( | ||
SNAPSHOT_ID_INHERITANCE_ENABLED, SNAPSHOT_ID_INHERITANCE_ENABLED_DEFAULT); | ||
this.canInheritSnapshotId = formatVersion > 1 || snapshotIdInheritanceEnabled; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may update the doc(https://iceberg.apache.org/docs/latest/configuration/#compatibility-flags) a bit to be clear that v2 table doesn't honor the property compatibility.snapshot-id-inheritance.enabled
, in case people get confused. Not a blocker. We can do it in a follow up PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the doc in this PR, that's a good point.
@@ -337,7 +337,7 @@ private void replaceManifests( | |||
addedManifests.forEach(rewriteManifests::addManifest); | |||
commit(rewriteManifests); | |||
|
|||
if (!snapshotIdInheritanceEnabled) { | |||
if (formatVersion == 1 && !snapshotIdInheritanceEnabled) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor suggestion: can we have format version as an extra parameter for the method as following, so that we can consolidate the logic from multiple places?
PropertyUtil.propertyAsBoolean(
table.properties(),
TableProperties.SNAPSHOT_ID_INHERITANCE_ENABLED,
TableProperties.SNAPSHOT_ID_INHERITANCE_ENABLED_DEFAULT,
formatVersion);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to generalize this in SnapshotProducer
, I'd delay a utility call like this until we have multiple use cases. It is a bit special as if format is V2, the value must be ignored.
820e6bb
to
10111fd
Compare
This PR avoids extra copies of manifests while optimizing metadata for V2 tables via Spark. Specifically, we have to rewrite all remotely produced manifests while committing again to assign explicit snapshot IDs in V1 tables unless the snapshot ID inheritance is enabled (false by default). In V2 tables, however, snapshot ID inheritance is always enabled. That's why it is safe to reuse the remotely produced manifests without an expensive rewrite on the driver as long as we have a V2 table. Before this change, we always relied on the table property to check if snapshot ID inheritance is enabled, which was added before the V2 spec.