-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API, Core: Add default value APIs and Avro implementation #9502
Conversation
@@ -155,6 +162,41 @@ public ValueReader<?> record(Type partner, Schema record, List<ValueReader<?>> f | |||
return recordReader(readPlan, avroSchemas.get(partner), record.getFullName()); | |||
} | |||
|
|||
Object toGenericRecord(Type type, Object data) { | |||
// Recursively convert data to GenericRecord if type is a StructType. | |||
// TODO: Rewrite this as a visitor. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Converting this into a TypeUtil.SchemaVisitor
would make sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I initially intended to implement it as a visitor, it turns out none of the existing visitor patterns (SchemaVisitor
or SchemaWithPartnerVisitor
) is a good fit for this use case since here we are co-traversing a schema and a data object (not just a schema or two schemas). I will clean up this code instead of using a visitor.
@@ -155,6 +162,41 @@ public ValueReader<?> record(Type partner, Schema record, List<ValueReader<?>> f | |||
return recordReader(readPlan, avroSchemas.get(partner), record.getFullName()); | |||
} | |||
|
|||
Object toGenericRecord(Type type, Object data) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit
Object toGenericRecord(Type type, Object data) { | |
Object toGenericRecord(Type type, Object initialDefault) { |
genericRecord.put( | ||
field.name(), toGenericRecord(field.type(), ((GenericRecord) data).get(index))); | ||
index++; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we assume that the structs are in the same position. For example:
struct<a (1): int, b (2): string>
But they could be swapped:
struct<b (2): string, a (1): int>
Since these are Iceberg constructs, I would expect lookups by ID to be done here as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be converting actual data (some constant) coming from the schema. The data is represented in a JSON representation denoting the default value. Not sure if there will be a notion of field ID here. Further, since this is converting data from one representation to another, the target object is created from scratch and we are just reformatting the input object so not sure if field ID comparison applies here.
Schema readerSchema = | ||
new Schema( | ||
required(999, "col1", Types.IntegerType.get()), | ||
Types.NestedField.optional(1000, "col2", type, null, defaultValue, defaultValue)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we test the initialDefault
and the writeDefault
separately?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR does not implement writeDefault
. Both implementation and test for writeDefault
can be a separate line of work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do we track all work to support default values?
"{\"keys\": [1, 2], \"values\": [\"foo\", \"bar\"]}" | ||
}, | ||
{ | ||
Types.StructType.of( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we also want to add some rainy-day scenarios, where a required/optional field is missing, out of order, etc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I think this (test) code boils down to calling the new default value API in NestedField
with both the field schema and its default value. I see the only thing that could go wrong here is the compatibility between the schema and the default value structure. If they are not compatible, then toGenericRecord
will throw a cast exception (e.g., not being able to cast a map to a list, etc). Do we want to add tests that ensure some exception is thrown in this case (the incompatibility between data and type)?
One more thing, we're also leaking V3 spec features into the codebase. Should we guard that with a flag? |
It does not seem we can easily access format version in those reader classes. Since this feature is only accessible through a new API, do you think the guard could be on the client side in engine integrations? |
1720f5c
to
a0792d5
Compare
For #9008, we added a check in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
* Refactor to use a NestedField builder and remove nested defaults. * Remove unsupported test cases. * Apply spotless * Rename for clarity.
Just wanted to ask does this only apply to Avro or does adding it to AvroGenericReader also cover other file types ORC and Parquet? If it doesn't support ORC and Parquet what is the plan to support these? |
Co-authored-by: Walaa Eldin Moustafa <[email protected]> Co-authored-by: Ryan Blue <[email protected]>
This PR adds default value APIs according to the default value spec, and implements it in the
GenericAvroReader
case. It uses aConstantReader
to fill in the default values of fields from their respectiveinitialDefault()
method, and rebases the implementation in #6004 to leverage new APIs introduced in #9366.