-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Core: Add Variant implementation to read serialized objects #11415
Conversation
@aihuaxu I cleaned up my implementation, added tests, and fixed quite a few bugs. Please take a look to help validate that it implements the spec correctly. Thanks! |
My main overall question on this is whether or not this implementation belongs in the Iceberg project or in the Parquet project? I'm a little worried about a proliferation of implementations especially if we potentially would use two different implementations within the same code path (1 for spark and 1 for core) I'll do a real review later |
core/src/main/java/org/apache/iceberg/variants/SerializedObject.java
Outdated
Show resolved
Hide resolved
The Spark failures are a port conflict. I think it's unrelated to these changes. We'll see the next time CI runs (I'm sure we'll have more changes to trigger them) |
core/src/test/java/org/apache/iceberg/variants/TestShreddedObject.java
Outdated
Show resolved
Hide resolved
core/src/test/java/org/apache/iceberg/variants/TestShreddedObject.java
Outdated
Show resolved
Hide resolved
} | ||
|
||
@Override | ||
public VariantValue get(String field) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see what we are implementing in ShreddedObject: basically we are providing the same interface get(String field) as regular VariantObject.
Given the following example, assume event.location.latitude
is shredded while event.location.longitude
is not. How do we model shredded event
object - where to place the field location
? Of course, from read side, it doesn't matter if the we place location in shredded or unshredded. We check both.
event {
event_id;
location {
latitude;
longitude;
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If latitude
is shredded and longitude
is not, then location
must be a partially shredded object. That object would contain longitude
in the value
field and would have a typed_value
group that contains a latitude
group that has a value
and typed_value
pair.
And because location
is a partially shredded object, event
is also a partially shredded object that has a shredded location
group.
"b", | ||
Variants.of("iceberg"), | ||
"c", | ||
Variants.of(new BigDecimal("12.21"))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In our ShreddedObject model, the field values can be a VariantObject as well which could be a ShreddedObject, right?
Can we add such coverage?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What cases do you have in mind?
The tests here aren't intended to be exhaustive for the types that could be in the shredded fields. They just demonstrate the behavior between shredded and unshredded fields. For an object within a ShreddedObject
, there are two cases. If it is not shredded, then it will be handled by the wrapped SerializedObject
. And if it is shredded then it will be added to a ShreddedObject
using put
to place it in the shredded map. In both cases, we're just relying on the behavior of other classes to hold another SerializedObject
or ShreddedObject
, so I'm not sure what we would test that isn't already tested by the primitives here.
I think I have a question for PrimitiveWrapper.sizeInBytes() for binary. Otherwise, looks good to me. |
core/src/main/java/org/apache/iceberg/variants/VariantMetadata.java
Outdated
Show resolved
Hide resolved
} | ||
|
||
static short readLittleEndianInt16(ByteBuffer buffer, int offset) { | ||
return buffer.getShort(buffer.position() + offset); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to validate endianness on the buffer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm avoiding endianness checks in these methods because it would happen many times for the same buffer. I think it's better to do the checks at a coarse level, like when new SerializedVariant
instances are created from a buffer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're expecting a specific endianness, then we should state that at the class level docs at a minimum. I feel that's necessary here since it's not the default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
new byte[] {primitiveHeader(8), 0x04, (byte) 0xD2, 0x02, (byte) 0x96, 0x49}); | ||
|
||
assertThat(value.type()).isEqualTo(PhysicalType.DECIMAL4); | ||
assertThat(value.get()).isEqualTo(new BigDecimal("123456.7890")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the spec, DECIMAL4 should have the number with the precision between 1 and 9. 123456.7890 with precision 10 should store in DECIMAL8.
package org.apache.iceberg.variants; | ||
|
||
/** An variant array value. */ | ||
public interface VariantArray extends VariantValue { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems we should add numElements() in the interface in order to call get(index).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm debating whether to do this or to implement more Java interfaces, like List
or Map
. That's why I haven't exposed them yet. Another option is to simply make these Iterable
for consumption.
I think deferring the choice now is a good option. We can fill in some of these when we get to testing the read path.
68a535f
to
7312f19
Compare
core/src/main/java/org/apache/iceberg/variants/SerializedObject.java
Outdated
Show resolved
Hide resolved
ac10fb3
to
cfb7304
Compare
Thanks for reviewing, @aihuaxu and @danielcweeks! |
This PR adds an implementation of the Variant encoding spec that can read and construct serialized Variant buffers. This implementation was written using only the spec to validate that the spec is reasonably complete.
The public API interfaces are in core and consist of:
Variant
: a wrapper interface ofVariantMetadata
andVariantValue
VariantMetadata
: metadata dictionary for valuesVariantValue
: a generic interface for values that provides serialization toByteBuffer
VariantPrimitive
: a primitive valueVariantObject
: a variant object value withget(String)
to retrieve values by nameVariantArray
: a variant array value withget(int)
to retrieve values by positionThe implementation uses
ByteBuffer
and avoids copying. Values are lazily loaded as they are accessed and are initialized using slices of the parent buffer. Reads do not modify the original buffer. All buffers must use little-endian.Most testing is done by constructing variant cases as byte array constants. Many of these values can be used to check other implementations and may be added to the spec. This also includes test methods to create metadata, arrays, and objects for more complex cases such as multi-byte field IDs and offsets.