Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark: Avoid extra copies of manifests while optimizing V2 tables #8928

Merged

Conversation

aokolnychyi
Copy link
Contributor

@aokolnychyi aokolnychyi commented Oct 27, 2023

This PR avoids extra copies of manifests while optimizing metadata for V2 tables via Spark. Specifically, we have to rewrite all remotely produced manifests while committing again to assign explicit snapshot IDs in V1 tables unless the snapshot ID inheritance is enabled (false by default). In V2 tables, however, snapshot ID inheritance is always enabled. That's why it is safe to reuse the remotely produced manifests without an expensive rewrite on the driver as long as we have a V2 table. Before this change, we always relied on the table property to check if snapshot ID inheritance is enabled, which was added before the V2 spec.

Comment on lines +447 to +453
if (formatVersion == 1) {
assertThat(manifests.get(0).path()).isNotEqualTo(firstNewManifest.path());
assertThat(manifests.get(1).path()).isNotEqualTo(secondNewManifest.path());
} else {
assertThat(manifests.get(0).path()).isEqualTo(firstNewManifest.path());
assertThat(manifests.get(1).path()).isEqualTo(secondNewManifest.path());
}
Copy link
Contributor

@singhpk234 singhpk234 Oct 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[minor] shouldn't we check if it's format = 1 and snapshotIdInheritence enabled ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are actually two different tests. This one called basicReplacement and tests the default behavior. The one below explicitly validates enabled snapshot ID inheritance.

Copy link
Contributor

@flyrain flyrain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall. Left minor suggestions. Thanks @aokolnychyi !

ops.current()
.propertyAsBoolean(
SNAPSHOT_ID_INHERITANCE_ENABLED, SNAPSHOT_ID_INHERITANCE_ENABLED_DEFAULT);
this.canInheritSnapshotId = formatVersion > 1 || snapshotIdInheritanceEnabled;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may update the doc(https://iceberg.apache.org/docs/latest/configuration/#compatibility-flags) a bit to be clear that v2 table doesn't honor the property compatibility.snapshot-id-inheritance.enabled, in case people get confused. Not a blocker. We can do it in a follow up PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the doc in this PR, that's a good point.

@@ -337,7 +337,7 @@ private void replaceManifests(
addedManifests.forEach(rewriteManifests::addManifest);
commit(rewriteManifests);

if (!snapshotIdInheritanceEnabled) {
if (formatVersion == 1 && !snapshotIdInheritanceEnabled) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor suggestion: can we have format version as an extra parameter for the method as following, so that we can consolidate the logic from multiple places?

    PropertyUtil.propertyAsBoolean(
              table.properties(),
              TableProperties.SNAPSHOT_ID_INHERITANCE_ENABLED,
              TableProperties.SNAPSHOT_ID_INHERITANCE_ENABLED_DEFAULT,
              formatVersion);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to generalize this in SnapshotProducer, I'd delay a utility call like this until we have multiple use cases. It is a bit special as if format is V2, the value must be ignored.

@aokolnychyi aokolnychyi force-pushed the skip-rewriting-manifests-v2-tables branch from 820e6bb to 10111fd Compare October 31, 2023 05:16
@github-actions github-actions bot added the docs label Oct 31, 2023
@aokolnychyi aokolnychyi merged commit 50c5f26 into apache:main Oct 31, 2023
45 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants