Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data: Fix Parquet and Avro defaults date/time representation #11811

Merged
merged 2 commits into from
Dec 18, 2024

Conversation

rdblue
Copy link
Contributor

@rdblue rdblue commented Dec 18, 2024

This updates the Avro and Parquet readers for Spark and Iceberg's generic data model to produce the expected date/time classes. It also adds a test for primitive type defaults.

readPlan.add(Pair.of(pos, ValueReaders.constant(field.initialDefault())));
readPlan.add(
Pair.of(
pos, ValueReaders.constant(convert.apply(field.type(), field.initialDefault()))));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This raises a question: should we also convert the constants passed by idToConstant?

I think we should but not yet. The problem is that this conversion probably can't be done twice because of assumptions in the conversion (like casting strings to UTF8String in Spark) so it is worth a separate PR to refactor all of the conversions to be done in one place. That will also be different for each file format so it could get messy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do that in a separate PR indeed 👍

assertEquals(
field.type(),
GenericDataUtil.internalToGeneric(field.type(), field.initialDefault()),
actual.getField(field.name()));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the test bug. Rather than asserting that the value was passed back unchanged, it should have delegated to assertEquals that checks the representation of the value.

The expected value (default) also needs to be converted to generic because the generic data model tests use the generic data model as the source representation.

@rdblue rdblue force-pushed the fix-generic-default-classes branch from 9ad4789 to 3f90aa8 Compare December 18, 2024 18:53
@github-actions github-actions bot added the spark label Dec 18, 2024
@rdblue rdblue changed the title Data: Fix Parquet and Avro generic date/time representation Data: Fix Parquet and Avro defaults date/time representation Dec 18, 2024
@@ -97,7 +98,8 @@ public ValueReader<?> record(Type partner, Schema record, List<ValueReader<?>> f

Types.StructType expected = partner.asStructType();
List<Pair<Integer, ValueReader<?>>> readPlan =
ValueReaders.buildReadPlan(expected, record, fieldReaders, idToConstant);
ValueReaders.buildReadPlan(
expected, record, fieldReaders, idToConstant, GenericDataUtil::internalToGeneric);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is only needed for the generic data model and not for the internal data model because defaults are already using the correct classes for internal.


public abstract class DataTest {

protected abstract void writeAndValidate(Schema schema) throws IOException;

protected void writeAndValidate(Schema writeSchema, Schema expectedSchema) throws IOException {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To deduplicate code, I moved the new default test cases to the DataTest and AvroDataTest classes and disable them with assumptions where defaults aren't needed or supported.

}

@Override
protected void writeAndValidate(Schema writeSchema, Schema expectedSchema) throws IOException {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding default tests to the Avro read path as well.

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few nits around unboxing, but looks good 👍 Thanks for cleaning up the tests


switch (type.typeId()) {
case DATE:
return DateTimeUtil.dateFromDays((Integer) value);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return DateTimeUtil.dateFromDays((Integer) value);
return DateTimeUtil.dateFromDays((int) value);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Value is an Object, so I cast to the object rather than the primitive. Looks like you're right that we can cast directly and unbox at the same time based on the method signature... but I typically don't do that for style reasons. If you cast to the primitive, then there's a risk of introducing an unnecessary failure if the value is null and the receiving function accepts an Integer that can be null. I think casting to the object version is better to avoid those problems.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My reasoning is similar, but the other way around:

As you mentioned, the method accepts an int:
https://github.com/apache/iceberg/blob/91a1505d09cebcd1d088ac53cd42732c343883de/api/src/main/java/org/ap
ache/iceberg/util/DateTimeUtil.java#L49-L51

Casting to an Integer gives the impression to me that a null is acceptable here, while it is not. I'm not very strong on this one, let's pull in @nastra as a tiebreaker :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Ryan's reasoning. Casting to int will throw a NPE because it will first cast it to Integer and then throw the NPE when doing Integer.intValue(). That being said, it's safer to always cast to the object type instead of the primitive in case the value can be null

case DATE:
return DateTimeUtil.dateFromDays((Integer) value);
case TIME:
return DateTimeUtil.timeFromMicros((Long) value);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return DateTimeUtil.timeFromMicros((Long) value);
return DateTimeUtil.timeFromMicros((long) value);

return DateTimeUtil.timeFromMicros((Long) value);
case TIMESTAMP:
if (((Types.TimestampType) type).shouldAdjustToUTC()) {
return DateTimeUtil.timestamptzFromMicros((Long) value);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return DateTimeUtil.timestamptzFromMicros((Long) value);
return DateTimeUtil.timestamptzFromMicros((long) value);

if (((Types.TimestampType) type).shouldAdjustToUTC()) {
return DateTimeUtil.timestamptzFromMicros((Long) value);
} else {
return DateTimeUtil.timestampFromMicros((Long) value);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return DateTimeUtil.timestampFromMicros((Long) value);
return DateTimeUtil.timestampFromMicros((long) value);

if (value == null) {
return null;
}

switch (type.typeId()) {
case DECIMAL:
return Decimal.apply((BigDecimal) value);
case UUID:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch:

image

readPlan.add(Pair.of(pos, ValueReaders.constant(field.initialDefault())));
readPlan.add(
Pair.of(
pos, ValueReaders.constant(convert.apply(field.type(), field.initialDefault()))));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do that in a separate PR indeed 👍

@rdblue
Copy link
Contributor Author

rdblue commented Dec 18, 2024

Thanks for reviewing, @Fokko!

@rdblue rdblue merged commit d0effc6 into apache:main Dec 18, 2024
49 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants