Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core: fix reading of split offsets in manifests #8834

Merged
merged 7 commits into from
Oct 16, 2023

Conversation

bryanck
Copy link
Contributor

@bryanck bryanck commented Oct 14, 2023

This PR fixes a critical bug in reading split offsets from manifests. A change in #8336 added caching of the offsets collection in BaseFile to avoid reallocation. However, the Avro reader will reuse the same BaseFile object when reading, thus only the offsets from the first entry are allocated, and then those are reused for all other entries.

cc @aokolnychyi @RussellSpitzer @rdblue @danielcweeks This can result in corrupted metadata in cases the invalid offsets are read in then written back, e.g. when rewriting manifests.

@github-actions github-actions bot added the core label Oct 14, 2023
private static final RecursiveComparisonConfiguration FILE_COMPARISON_CONFIG =
RecursiveComparisonConfiguration.builder()
.withIgnoredFields(
"dataSequenceNumber", "fileOrdinal", "fileSequenceNumber", "fromProjectionPos")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this will compare split offsets now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it will compare all fields from the input data files, except for these excluded fields which aren't expected to match

@@ -127,6 +129,7 @@ public class TableTestBase {
.withFileSizeInBytes(10)
.withPartitionPath("data_bucket=3") // easy way to set partition data for now
.withRecordCount(1)
.withSplitOffsets(ImmutableList.of(4L, 10_000_000L, 20_000_000L))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comment but can we make the splits unique? Just want to make sure things will break in an obvious way if they are not matching

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I made this change

@advancedxy
Copy link
Contributor

Other than to revert the optimize in #8336, is it better to invalidate the cached splitOffsetList? The proposed change is in the org.apache.iceberg.BaseFile#put function:

...
      case 12:
        this.upperBounds = SerializableByteBufferMap.wrap((Map<Integer, ByteBuffer>) value);
        return;
      case 13:
        this.keyMetadata = ByteBuffers.toByteArray((ByteBuffer) value);
        return;
      case 14:
        this.splitOffsets = ArrayUtil.toLongArray((List<Long>) value);
        this.splitOffsetList = null; // invalidate the cache
        return;
      case 15:
        this.equalityIds = ArrayUtil.toIntArray((List<Integer>) value);
....

@nastra nastra added this to the Iceberg 1.4.1 milestone Oct 16, 2023
@bryanck
Copy link
Contributor Author

bryanck commented Oct 16, 2023

Other than to revert the optimize in #8336, is it better to invalidate the cached splitOffsetList? The proposed change is in the org.apache.iceberg.BaseFile#put function:
`

The goal of this PR is to revert the code to a known working state. There are a couple of optimizations that could be made (e.g. to equality IDs also) but felt that belongs in a separate PR.

"Should read the expected files",
Lists.newArrayList(FILE_A.path(), FILE_B.path(), FILE_C.path()),
files);
assertThat(files)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this can be simplified to

    assertThat(files)
          .usingRecursiveComparison()
          .ignoringFields(
              "dataSequenceNumber", "fileOrdinal", "fileSequenceNumber", "fromProjectionPos")
          .isEqualTo(Lists.newArrayList(FILE_A, FILE_B, FILE_C));

that way you don't need the static final field for the comparison config

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thought was the config variable could be reused in other tests

@advancedxy
Copy link
Contributor

Other than to revert the optimize in #8336, is it better to invalidate the cached splitOffsetList? The proposed change is in the org.apache.iceberg.BaseFile#put function:
`

The goal of this PR is to revert the code to a known working state. There are a couple of optimizations that could be made (e.g. to equality IDs also) but felt that belongs in a separate PR.

If the current PR is intended to be included in the hot fix version 1.4.1, then I think it's the safe way to revert the code.

}

return splitOffsetList;
return ArrayUtil.toUnmodifiableLongList(splitOffsets);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤦 Good catch.

@rdblue rdblue merged commit 46cad6d into apache:main Oct 16, 2023
45 checks passed
nastra pushed a commit to nastra/iceberg that referenced this pull request Oct 16, 2023
The list was not correctly invalidated when reusing the file.
rdblue pushed a commit that referenced this pull request Oct 17, 2023
The list was not correctly invalidated when reusing the file.

Co-authored-by: Bryan Keller <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants