Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API, Core: Add default value APIs and Avro implementation #9502

Merged
merged 11 commits into from
Oct 4, 2024

Conversation

wmoustafa
Copy link
Contributor

@wmoustafa wmoustafa commented Jan 17, 2024

This PR adds default value APIs according to the default value spec, and implements it in the GenericAvroReader case. It uses a ConstantReader to fill in the default values of fields from their respective initialDefault() method, and rebases the implementation in #6004 to leverage new APIs introduced in #9366.

@wmoustafa
Copy link
Contributor Author

@Fokko Could you please take a look since you helped with #9366? FYI @rdblue.

@@ -155,6 +162,41 @@ public ValueReader<?> record(Type partner, Schema record, List<ValueReader<?>> f
return recordReader(readPlan, avroSchemas.get(partner), record.getFullName());
}

Object toGenericRecord(Type type, Object data) {
// Recursively convert data to GenericRecord if type is a StructType.
// TODO: Rewrite this as a visitor.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Converting this into a TypeUtil.SchemaVisitor would make sense

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I initially intended to implement it as a visitor, it turns out none of the existing visitor patterns (SchemaVisitor or SchemaWithPartnerVisitor) is a good fit for this use case since here we are co-traversing a schema and a data object (not just a schema or two schemas). I will clean up this code instead of using a visitor.

@@ -155,6 +162,41 @@ public ValueReader<?> record(Type partner, Schema record, List<ValueReader<?>> f
return recordReader(readPlan, avroSchemas.get(partner), record.getFullName());
}

Object toGenericRecord(Type type, Object data) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit

Suggested change
Object toGenericRecord(Type type, Object data) {
Object toGenericRecord(Type type, Object initialDefault) {

Comment on lines 173 to 175
genericRecord.put(
field.name(), toGenericRecord(field.type(), ((GenericRecord) data).get(index)));
index++;
Copy link
Contributor

@Fokko Fokko Jun 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we assume that the structs are in the same position. For example:

struct<a (1): int, b (2): string>

But they could be swapped:

struct<b (2): string, a (1): int>

Since these are Iceberg constructs, I would expect lookups by ID to be done here as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be converting actual data (some constant) coming from the schema. The data is represented in a JSON representation denoting the default value. Not sure if there will be a notion of field ID here. Further, since this is converting data from one representation to another, the target object is created from scratch and we are just reformatting the input object so not sure if field ID comparison applies here.

Schema readerSchema =
new Schema(
required(999, "col1", Types.IntegerType.get()),
Types.NestedField.optional(1000, "col2", type, null, defaultValue, defaultValue));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we test the initialDefault and the writeDefault separately?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR does not implement writeDefault. Both implementation and test for writeDefault can be a separate line of work.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do we track all work to support default values?

"{\"keys\": [1, 2], \"values\": [\"foo\", \"bar\"]}"
},
{
Types.StructType.of(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we also want to add some rainy-day scenarios, where a required/optional field is missing, out of order, etc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think this (test) code boils down to calling the new default value API in NestedField with both the field schema and its default value. I see the only thing that could go wrong here is the compatibility between the schema and the default value structure. If they are not compatible, then toGenericRecord will throw a cast exception (e.g., not being able to cast a map to a list, etc). Do we want to add tests that ensure some exception is thrown in this case (the incompatibility between data and type)?

@Fokko
Copy link
Contributor

Fokko commented Jun 17, 2024

One more thing, we're also leaking V3 spec features into the codebase. Should we guard that with a flag?

@wmoustafa
Copy link
Contributor Author

wmoustafa commented Jul 10, 2024

One more thing, we're also leaking V3 spec features into the codebase. Should we guard that with a flag?

It does not seem we can easily access format version in those reader classes. Since this feature is only accessible through a new API, do you think the guard could be on the client side in engine integrations?

@Fokko Fokko added this to the Iceberg V3 Spec milestone Jul 31, 2024
@nastra nastra self-requested a review August 1, 2024 15:12
@wmoustafa wmoustafa changed the title [Reference PR] [API + Avro] Add default value APIs and Avro implementation [API + Avro] Add default value APIs and Avro implementation Aug 21, 2024
@rdblue rdblue changed the title [API + Avro] Add default value APIs and Avro implementation API, Core: Add default value APIs and Avro implementation Sep 11, 2024
@rdblue
Copy link
Contributor

rdblue commented Sep 11, 2024

One more thing, we're also leaking V3 spec features into the codebase. Should we guard that with a flag?

For #9008, we added a check in TableMetadata.Builder to prevent v1 and v2 tables from using newer features. I think that we should update that check (Schema.checkCompatibility(Schema, int)) to include looking for defaults.

Copy link
Member

@hussein-awala hussein-awala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

rdblue and others added 3 commits October 1, 2024 16:24
* Refactor to use a NestedField builder and remove nested defaults.

* Remove unsupported test cases.

* Apply spotless

* Rename for clarity.
@rdblue rdblue merged commit 8190ce7 into apache:main Oct 4, 2024
50 checks passed
@emkornfield
Copy link
Contributor

emkornfield commented Nov 26, 2024

Just wanted to ask does this only apply to Avro or does adding it to AvroGenericReader also cover other file types ORC and Parquet? If it doesn't support ORC and Parquet what is the plan to support these?

zachdisc pushed a commit to zachdisc/iceberg that referenced this pull request Dec 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants