-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data: Fix Parquet and Avro defaults date/time representation #11811
Conversation
readPlan.add(Pair.of(pos, ValueReaders.constant(field.initialDefault()))); | ||
readPlan.add( | ||
Pair.of( | ||
pos, ValueReaders.constant(convert.apply(field.type(), field.initialDefault())))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This raises a question: should we also convert the constants passed by idToConstant
?
I think we should but not yet. The problem is that this conversion probably can't be done twice because of assumptions in the conversion (like casting strings to UTF8String
in Spark) so it is worth a separate PR to refactor all of the conversions to be done in one place. That will also be different for each file format so it could get messy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's do that in a separate PR indeed 👍
assertEquals( | ||
field.type(), | ||
GenericDataUtil.internalToGeneric(field.type(), field.initialDefault()), | ||
actual.getField(field.name())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was the test bug. Rather than asserting that the value was passed back unchanged, it should have delegated to assertEquals
that checks the representation of the value.
The expected value (default) also needs to be converted to generic because the generic data model tests use the generic data model as the source representation.
9ad4789
to
3f90aa8
Compare
@@ -97,7 +98,8 @@ public ValueReader<?> record(Type partner, Schema record, List<ValueReader<?>> f | |||
|
|||
Types.StructType expected = partner.asStructType(); | |||
List<Pair<Integer, ValueReader<?>>> readPlan = | |||
ValueReaders.buildReadPlan(expected, record, fieldReaders, idToConstant); | |||
ValueReaders.buildReadPlan( | |||
expected, record, fieldReaders, idToConstant, GenericDataUtil::internalToGeneric); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is only needed for the generic data model and not for the internal data model because defaults are already using the correct classes for internal.
|
||
public abstract class DataTest { | ||
|
||
protected abstract void writeAndValidate(Schema schema) throws IOException; | ||
|
||
protected void writeAndValidate(Schema writeSchema, Schema expectedSchema) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To deduplicate code, I moved the new default test cases to the DataTest
and AvroDataTest
classes and disable them with assumptions where defaults aren't needed or supported.
} | ||
|
||
@Override | ||
protected void writeAndValidate(Schema writeSchema, Schema expectedSchema) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding default tests to the Avro read path as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few nits around unboxing, but looks good 👍 Thanks for cleaning up the tests
|
||
switch (type.typeId()) { | ||
case DATE: | ||
return DateTimeUtil.dateFromDays((Integer) value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return DateTimeUtil.dateFromDays((Integer) value); | |
return DateTimeUtil.dateFromDays((int) value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Value is an Object
, so I cast to the object rather than the primitive. Looks like you're right that we can cast directly and unbox at the same time based on the method signature... but I typically don't do that for style reasons. If you cast to the primitive, then there's a risk of introducing an unnecessary failure if the value is null
and the receiving function accepts an Integer
that can be null. I think casting to the object version is better to avoid those problems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My reasoning is similar, but the other way around:
As you mentioned, the method accepts an int
:
https://github.com/apache/iceberg/blob/91a1505d09cebcd1d088ac53cd42732c343883de/api/src/main/java/org/ap
ache/iceberg/util/DateTimeUtil.java#L49-L51
Casting to an Integer
gives the impression to me that a null
is acceptable here, while it is not. I'm not very strong on this one, let's pull in @nastra as a tiebreaker :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Ryan's reasoning. Casting to int
will throw a NPE because it will first cast it to Integer
and then throw the NPE when doing Integer.intValue()
. That being said, it's safer to always cast to the object type instead of the primitive in case the value can be null
case DATE: | ||
return DateTimeUtil.dateFromDays((Integer) value); | ||
case TIME: | ||
return DateTimeUtil.timeFromMicros((Long) value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return DateTimeUtil.timeFromMicros((Long) value); | |
return DateTimeUtil.timeFromMicros((long) value); |
return DateTimeUtil.timeFromMicros((Long) value); | ||
case TIMESTAMP: | ||
if (((Types.TimestampType) type).shouldAdjustToUTC()) { | ||
return DateTimeUtil.timestamptzFromMicros((Long) value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return DateTimeUtil.timestamptzFromMicros((Long) value); | |
return DateTimeUtil.timestamptzFromMicros((long) value); |
if (((Types.TimestampType) type).shouldAdjustToUTC()) { | ||
return DateTimeUtil.timestamptzFromMicros((Long) value); | ||
} else { | ||
return DateTimeUtil.timestampFromMicros((Long) value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return DateTimeUtil.timestampFromMicros((Long) value); | |
return DateTimeUtil.timestampFromMicros((long) value); |
if (value == null) { | ||
return null; | ||
} | ||
|
||
switch (type.typeId()) { | ||
case DECIMAL: | ||
return Decimal.apply((BigDecimal) value); | ||
case UUID: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
readPlan.add(Pair.of(pos, ValueReaders.constant(field.initialDefault()))); | ||
readPlan.add( | ||
Pair.of( | ||
pos, ValueReaders.constant(convert.apply(field.type(), field.initialDefault())))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's do that in a separate PR indeed 👍
Thanks for reviewing, @Fokko! |
This updates the Avro and Parquet readers for Spark and Iceberg's generic data model to produce the expected date/time classes. It also adds a test for primitive type defaults.