Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core: Add internal Avro reader #11108

Merged
merged 3 commits into from
Oct 7, 2024
Merged

Conversation

rdblue
Copy link
Contributor

@rdblue rdblue commented Sep 11, 2024

This adds an Avro reader that produces Iceberg's internal object model and uses StructLike rather than Avro's IndexedRecord.

@rdblue rdblue requested a review from aokolnychyi September 11, 2024 00:01
@github-actions github-actions bot added the core label Sep 11, 2024
@rdblue rdblue force-pushed the internal-avro-reader branch from a0091f9 to 25e6ce9 Compare September 13, 2024 23:59
@rdblue
Copy link
Contributor Author

rdblue commented Sep 14, 2024

I've separated out the first commit as #11132. This still includes it but should merge cleanly if merged after #11132.

@rdblue rdblue force-pushed the internal-avro-reader branch 2 times, most recently from 81eacc4 to 6b6cd1b Compare September 18, 2024 19:13
Copy link
Contributor

@aokolnychyi aokolnychyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great to run our benchmarks like ManifestReadBenchmark to ensure there is no regression.

core/src/main/java/org/apache/iceberg/ManifestReader.java Outdated Show resolved Hide resolved
*
* @param <T> Java type returned by the reader
*/
public class InternalReader<T> implements DatumReader<T>, SupportsRowPosition {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need anything in the name to indicate it is specific to Avro? Will the name clash once we add a similar path for Parquet?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could name it InternalAvroReader but in most places we just let the names conflict because we're only working with Parquet or Avro, not both. Is this something you think we should do?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about a potential reader/writer factory that would create an appropriate instance based on the file type.


private final Types.StructType expectedType;
private final Map<Integer, Class<? extends StructLike>> typeMap = Maps.newHashMap();
private final Map<Integer, Object> idToConstant = ImmutableMap.of();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually need this field if it is immutable? Will it change once we add default values?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is going to change when we have default values and will be populated in the next PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you plan another PR to add V3 default-values support (just curious) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbonofre, sorry I missed this earlier.

I was wrong in my last comment. Default values come directly from the expected schema and support for them was added when I rebased this on top of @wmoustafa's recent PR (#9502). This constant map is here as a placeholder for when we add the ability to pass in constants that override fields, which is how we implement partition values. Those must be stored in data files, but are not read.

LogicalType logicalType = primitive.getLogicalType();
if (logicalType != null) {
switch (logicalType.getName()) {
case "date":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These string constants seem fragile but I don't have a better suggestion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's cleaner this way. No need to introduce constants that are not widely used and would need to be public for any reuse.

@@ -44,7 +44,8 @@ public class TestManifestReader extends TestBase {
"fileOrdinal",
"fileSequenceNumber",
"fromProjectionPos",
"manifestLocation")
"manifestLocation",
"partitionData.partitionType.fieldsById")
Copy link
Contributor

@aokolnychyi aokolnychyi Sep 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did we need this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the tests were moved to JUnit 5, the previous assertions were updated to use reflection and a recursive equality check. This lazy field is now present in the value that gets returned, although it doesn't really affect anything. This change gets the tests to work and I don't want to fix the real issue -- that the switch to JUnit5 wasn't done right and checks more than the original tests did rather than porting the original assertions.

Needless to say, I'm not a fan of how that conversion went, but it isn't something we should fix here.

@rdblue rdblue force-pushed the internal-avro-reader branch 2 times, most recently from 3f1bd3b to 474d444 Compare October 4, 2024 22:00
@rdblue rdblue force-pushed the internal-avro-reader branch from 474d444 to c3bcee3 Compare October 4, 2024 22:03
@aokolnychyi
Copy link
Contributor

LGTM.

@rdblue
Copy link
Contributor Author

rdblue commented Oct 7, 2024

Here are the benchmark results:

## This PR
Benchmark                               Mode  Cnt  Score   Error  Units
ManifestReadBenchmark.readManifestFile    ss    5  4.303 ± 0.687   s/op

## Main branch (8190ce7e6b)
Benchmark                               Mode  Cnt  Score   Error  Units
ManifestReadBenchmark.readManifestFile    ss    5  4.673 ± 0.535   s/op

The difference is within the error range, so I don't think there's a significant penalty (there should hopefully be a slight improvement, but it's hard to tell).

@rdblue rdblue merged commit f0e4fd2 into apache:main Oct 7, 2024
49 checks passed
@rdblue
Copy link
Contributor Author

rdblue commented Oct 7, 2024

Thanks for the review, @aokolnychyi!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants