-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Core: fix reading of split offsets in manifests #8834
Conversation
private static final RecursiveComparisonConfiguration FILE_COMPARISON_CONFIG = | ||
RecursiveComparisonConfiguration.builder() | ||
.withIgnoredFields( | ||
"dataSequenceNumber", "fileOrdinal", "fileSequenceNumber", "fromProjectionPos") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this will compare split offsets now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, it will compare all fields from the input data files, except for these excluded fields which aren't expected to match
@@ -127,6 +129,7 @@ public class TableTestBase { | |||
.withFileSizeInBytes(10) | |||
.withPartitionPath("data_bucket=3") // easy way to set partition data for now | |||
.withRecordCount(1) | |||
.withSplitOffsets(ImmutableList.of(4L, 10_000_000L, 20_000_000L)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comment but can we make the splits unique? Just want to make sure things will break in an obvious way if they are not matching
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I made this change
Other than to revert the optimize in #8336, is it better to invalidate the cached ...
case 12:
this.upperBounds = SerializableByteBufferMap.wrap((Map<Integer, ByteBuffer>) value);
return;
case 13:
this.keyMetadata = ByteBuffers.toByteArray((ByteBuffer) value);
return;
case 14:
this.splitOffsets = ArrayUtil.toLongArray((List<Long>) value);
this.splitOffsetList = null; // invalidate the cache
return;
case 15:
this.equalityIds = ArrayUtil.toIntArray((List<Integer>) value);
.... |
The goal of this PR is to revert the code to a known working state. There are a couple of optimizations that could be made (e.g. to equality IDs also) but felt that belongs in a separate PR. |
"Should read the expected files", | ||
Lists.newArrayList(FILE_A.path(), FILE_B.path(), FILE_C.path()), | ||
files); | ||
assertThat(files) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this can be simplified to
assertThat(files)
.usingRecursiveComparison()
.ignoringFields(
"dataSequenceNumber", "fileOrdinal", "fileSequenceNumber", "fromProjectionPos")
.isEqualTo(Lists.newArrayList(FILE_A, FILE_B, FILE_C));
that way you don't need the static final field for the comparison config
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thought was the config variable could be reused in other tests
If the current PR is intended to be included in the hot fix version 1.4.1, then I think it's the safe way to revert the code. |
} | ||
|
||
return splitOffsetList; | ||
return ArrayUtil.toUnmodifiableLongList(splitOffsets); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤦 Good catch.
The list was not correctly invalidated when reusing the file.
The list was not correctly invalidated when reusing the file. Co-authored-by: Bryan Keller <[email protected]>
This PR fixes a critical bug in reading split offsets from manifests. A change in #8336 added caching of the offsets collection in
BaseFile
to avoid reallocation. However, the Avro reader will reuse the same BaseFile object when reading, thus only the offsets from the first entry are allocated, and then those are reused for all other entries.cc @aokolnychyi @RussellSpitzer @rdblue @danielcweeks This can result in corrupted metadata in cases the invalid offsets are read in then written back, e.g. when rewriting manifests.