AVRO-3666 [Java] Use the new schema parser #2642

opwvhk · 2023-12-19T11:55:15Z

What is the purpose of the change

This change updates schema parsing so forward references are handles in a uniform way, regardless of the parser used (JSON, IDL, ...).

This change is the missing bit for AVRO-3666, re-enabling some disabled tests (and more changes to debug). This also improves the schema resolving visitor copied from the original IDL parser (it didn't handle circular references in logical types), and aliases pointing to the null namespace (this was lost when querying aliases).

Verifying this change

This change updates a lot of tests to use the new schema parser (or the internal JSON parser), but does not alter tests. As such, the change is covered by existing schema parsing tests.

Documentation

Does this pull request introduce a new feature? (~~yes~~ / no)
If yes, how is the feature documented? (not applicable / ~~docs / JavaDocs / not documented~~)

This undoes the split schema parsing to allow forward references, which is to be handles by the SchemaParser & ParseContext classes. It uses the new ParseContext for the classic schema parser to accommodate this. Next step: use the new SchemaParser and resolve unresolved / forward references after parsing. This will also resolve "forward" references that were parsed in subsequent files.

By resolving references after parsing, we allow both forward references within a file as between subsequent files. This change also includes using the new SchemaParser everywhere, as using it is the best way to flush out bugs.

Also includes changes necessary to debug.

lang/java/avro/src/main/java/org/apache/avro/Schema.java

lang/java/avro/src/main/java/org/apache/avro/JsonSchemaParser.java

lang/java/avro/src/main/java/org/apache/avro/ParseContext.java

lang/java/avro/src/main/java/org/apache/avro/Schema.java

The wrong exclusion was removed.

clesaec

Some comments (not yet enough time to see all)

clesaec · 2023-12-20T08:44:45Z

lang/java/avro/src/main/java/org/apache/avro/ParseContext.java

+      NameValidator saved = Schema.getNameValidator();
+      try {
+        Schema.setNameValidator(nameValidator); // Ensure we use the same validation.
+        HashMap<String, Schema> result = new LinkedHashMap<>(oldSchemas);


Why LinkedHashMap here; where it needs to keep order ?
Map<String, Schema> result = new HashMap<>(oldSchemas);
is not enough ?

This is one of the things I'm not sure about: how to provide the parsed schemas. Ideally, a this would be a Set<Schema> (meaning a HashMap would suffice here).
However, current code uses a List<Schema>, and this is generally the more prevalent collection type.

lang/java/avro/src/main/java/org/apache/avro/Schema.java

This ensures the SchemaParser never returns unresolved schemata.

lang/java/compiler/src/test/java/org/apache/avro/compiler/schema/TestSchemas.java

lang/java/avro/src/main/java/org/apache/avro/JsonSchemaParser.java

lang/java/avro/src/main/java/org/apache/avro/ParseContext.java

lang/java/compiler/src/main/java/org/apache/avro/compiler/idl/SchemaResolver.java

lang/java/idl/src/main/java/org/apache/avro/idl/IdlReader.java

The JSON schema parser is quite complex (it is a large method). This change splits it in multiple methods, naming the various stages.

lang/java/avro/src/main/java/org/apache/avro/Schema.java

Fokko

I tried to review this, but don't know where to start. The schema parser is very fundamental and there are 150+ files to review. My biggest concern is about breaking public API's. We had this in the Avro 1.8 to 1.9 version because we had to remove codehaus Jackson from the public API, and it was very hard to get this downstream into other projects.

lang/java/avro/src/main/java/org/apache/avro/util/SchemaResolver.java

lang/java/avro/src/main/java/org/apache/avro/Schema.java

opwvhk · 2024-02-14T08:28:06Z

I tried to review this, but don't know where to start. The schema parser is very fundamental and there are 150+ files to review. My biggest concern is about breaking public API's. We had this in the Avro 1.8 to 1.9 version because we had to remove codehaus Jackson from the public API, and it was very hard to get this downstream into other projects.

These are valid concerns, so I'll address them each.

Breaking API's:

Are not intended: all existing methods should be deprecated instead.
There's only one exception that I know of: Parser.parse(Iterable<File>) was added after the last release (AVRO-3805: Parse multiple schema #2375)

PR size:

Largely unavoidable, given how much we use the parser
Starting point is separating the parser and its tests from the rest
The parser requires extra scrutiny, the rest is of the category "it works, so it's good enough if it's readable"

This change reduces the PR size, but does require some extra work after merging: the new SchemaParser class is hardly used, and the (now) obsolete Schema.Parser class is used heavily.

lang/java/compiler/src/main/java/org/apache/avro/compiler/specific/SchemaTask.java

lang/java/compiler/src/test/java/org/apache/avro/specific/TestSpecificData.java

lang/java/compiler/src/test/java/org/apache/avro/compiler/specific/TestSpecificCompiler.java

lang/java/maven-plugin/src/test/java/org/apache/avro/mojo/TestInduceMojo.java

Fokko

Looking good @opwvhk. Last time while going over it, it looked like a lot of public APIs were changed, it those are unreleased as you already mentioned in a comment. I also ran this branch against the Iceberg test-suite, and no issues there. Thanks for working on this! 🚀

Fokko · 2024-04-03T12:30:44Z

lang/java/avro/src/main/java/org/apache/avro/ParseContext.java

@@ -18,25 +18,36 @@
 package org.apache.avro;


This is a new file, so it is okay to break the APIs here

lang/java/avro/src/main/java/org/apache/avro/Schema.java

Fokko · 2024-04-03T13:15:50Z

@opwvhk Could you resolve the merge conflicts?

Co-authored-by: Fokko Driesprong <[email protected]>

Fokko · 2024-04-04T09:00:12Z

Thanks for working on this @opwvhk 🙌

g1rjeevan · 2024-04-10T12:56:47Z

Hi Team,

Did 1.12.0 or 1.11.4 got released ? I see CVEs fixes related commons-compress 1.26.0 are already in place. Seems to be release is pending. If not released, what would be the ETA ? cc: @Fokko

Fokko · 2024-04-10T13:39:11Z

A release is near, see the public mailing list: https://lists.apache.org/thread/qg70g7d9j5odkf8fxnnm342mm2kj997l I would say this month since the release process always takes some time in open source.

g1rjeevan · 2024-05-07T04:34:48Z

@Fokko Any update on the release ?

jbonofre · 2024-05-07T16:22:00Z

@g1rjeevan I'm working on both 1.11.4 and 1.12.0 releases preparation. I need to merge one pending change. I hope to submit the release to vote next week.

* AVRO-3666: Redo schema parsing code This undoes the split schema parsing to allow forward references, which is to be handles by the SchemaParser & ParseContext classes. It uses the new ParseContext for the classic schema parser to accommodate this. Next step: use the new SchemaParser and resolve unresolved / forward references after parsing. This will also resolve "forward" references that were parsed in subsequent files. * AVRO-3666: Resolve references after parsing By resolving references after parsing, we allow both forward references within a file as between subsequent files. This change also includes using the new SchemaParser everywhere, as using it is the best way to flush out bugs. * AVRO-3666: Remove wrong test * AVRO-1535: Fix aliases as well * AVRO-3666: Re-enable disabled test Also includes changes necessary to debug. * AVRO-3666: Fix RAT exclusion The wrong exclusion was removed. * AVRO-3666: Remove unused field * AVRO-3666: Introduce SchemaParser.ParseResult This ensures the SchemaParser never returns unresolved schemata. * AVRO-3666: Use SchemaParser for documentation * AVRO-3666: Refactor after review * AVRO-3666: Fix javadoc * AVRO-3666: Fix merge bug * AVRO-3666: Fix CodeQL warnings * AVRO-3666: Increase test coverage * AVRO-3666: Fix tests * AVRO-3666: Refactor schema parsing for readability The JSON schema parser is quite complex (it is a large method). This change splits it in multiple methods, naming the various stages. * AVRO-3666: rename method to avoid confusion * AVRO-3666: Reduce PR size This change reduces the PR size, but does require some extra work after merging: the new SchemaParser class is hardly used, and the (now) obsolete Schema.Parser class is used heavily. * AVRO-3666: Reduce PR size more * AVRO-3666: Reduce PR size again * AVRO-3666: Spotless * Update lang/java/avro/src/main/java/org/apache/avro/Schema.java Co-authored-by: Fokko Driesprong <[email protected]> * AVRO-3666: Spotless --------- Co-authored-by: Fokko Driesprong <[email protected]>

opwvhk added 5 commits December 19, 2023 12:43

AVRO-3666: Resolve references after parsing

810657c

By resolving references after parsing, we allow both forward references within a file as between subsequent files. This change also includes using the new SchemaParser everywhere, as using it is the best way to flush out bugs.

AVRO-3666: Remove wrong test

9881669

AVRO-1535: Fix aliases as well

1243cec

AVRO-3666: Re-enable disabled test

30c51ab

Also includes changes necessary to debug.

opwvhk requested a review from clesaec December 19, 2023 11:55

github-actions bot added Java Pull Requests for Java binding build website labels Dec 19, 2023

github-advanced-security bot found potential problems Dec 19, 2023

View reviewed changes

opwvhk added 2 commits December 19, 2023 13:05

AVRO-3666: Fix RAT exclusion

20fa338

The wrong exclusion was removed.

AVRO-3666: Remove unused field

294fdd9

clesaec reviewed Dec 20, 2023

View reviewed changes

opwvhk added 6 commits January 16, 2024 20:58

AVRO-3666: Introduce SchemaParser.ParseResult

f2dbd26

This ensures the SchemaParser never returns unresolved schemata.

AVRO-3666: Use SchemaParser for documentation

841be1b

AVRO-3666: Refactor after review

87968dd

AVRO-3666: Fix javadoc

da1935a

AVRO-3666: Merge branch 'main' into AVRO-3666-use-schema-parser

2707900

AVRO-3666: Fix merge bug

3b05640

github-advanced-security bot found potential problems Jan 19, 2024

View reviewed changes

opwvhk added 5 commits January 20, 2024 14:28

AVRO-3666: Fix CodeQL warnings

fe35f95

AVRO-3666: Increase test coverage

6223f33

AVRO-3666: Fix tests

2cdf9e9

AVRO-3666: Refactor schema parsing for readability

838eb0d

The JSON schema parser is quite complex (it is a large method). This change splits it in multiple methods, naming the various stages.

AVRO-3666: Merge branch 'main' into AVRO-3666-use-schema-parser

41806c6

github-advanced-security bot found potential problems Jan 23, 2024

View reviewed changes

lang/java/avro/src/main/java/org/apache/avro/Schema.java Fixed Show fixed Hide fixed

lang/java/avro/src/main/java/org/apache/avro/Schema.java Dismissed Show dismissed Hide dismissed

AVRO-3666: rename method to avoid confusion

081e3ed

Fokko reviewed Feb 6, 2024

View reviewed changes

lang/java/avro/src/main/java/org/apache/avro/util/SchemaResolver.java Show resolved Hide resolved

lang/java/avro/src/main/java/org/apache/avro/Schema.java Show resolved Hide resolved

AVRO-3666: Merge branch 'main' into AVRO-3666-use-schema-parser

7c36b1c

AVRO-3666: Reduce PR size

675064f

This change reduces the PR size, but does require some extra work after merging: the new SchemaParser class is hardly used, and the (now) obsolete Schema.Parser class is used heavily.

github-actions bot removed the build label Feb 27, 2024

github-advanced-security bot found potential problems Feb 27, 2024

View reviewed changes

opwvhk added 2 commits February 27, 2024 16:11

AVRO-3666: Reduce PR size more

e1b257a

AVRO-3666: Reduce PR size again

ca2ad31

opwvhk requested a review from Fokko February 27, 2024 15:32

Fokko approved these changes Apr 3, 2024

View reviewed changes

opwvhk and others added 4 commits April 4, 2024 09:24

AVRO-3666: Merge branch 'main' into AVRO-3666-use-schema-parser

1ea4881

AVRO-3666: Spotless

2dd88f6

Update lang/java/avro/src/main/java/org/apache/avro/Schema.java

b3f6b00

Co-authored-by: Fokko Driesprong <[email protected]>

AVRO-3666: Spotless

c5a7659

Fokko merged commit 876eae3 into apache:main Apr 4, 2024
8 checks passed

pvillard31 added a commit to pvillard31/nifi that referenced this pull request Oct 2, 2024

Fix Avro parsing due to change apache/avro#2642 in Avro 1.12.0

6c2c9d3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AVRO-3666 [Java] Use the new schema parser #2642

AVRO-3666 [Java] Use the new schema parser #2642

opwvhk commented Dec 19, 2023

clesaec left a comment

clesaec Dec 20, 2023

opwvhk Jan 19, 2024

Fokko left a comment

opwvhk commented Feb 14, 2024

Fokko left a comment

Fokko Apr 3, 2024

Fokko commented Apr 3, 2024

Fokko commented Apr 4, 2024

g1rjeevan commented Apr 10, 2024

Fokko commented Apr 10, 2024

g1rjeevan commented May 7, 2024

jbonofre commented May 7, 2024

AVRO-3666 [Java] Use the new schema parser #2642

AVRO-3666 [Java] Use the new schema parser #2642

Conversation

opwvhk commented Dec 19, 2023

What is the purpose of the change

Verifying this change

Documentation

clesaec left a comment

Choose a reason for hiding this comment

clesaec Dec 20, 2023

Choose a reason for hiding this comment

opwvhk Jan 19, 2024

Choose a reason for hiding this comment

Fokko left a comment

Choose a reason for hiding this comment

opwvhk commented Feb 14, 2024

Fokko left a comment

Choose a reason for hiding this comment

Fokko Apr 3, 2024

Choose a reason for hiding this comment

Fokko commented Apr 3, 2024

Fokko commented Apr 4, 2024

g1rjeevan commented Apr 10, 2024

Fokko commented Apr 10, 2024

g1rjeevan commented May 7, 2024

jbonofre commented May 7, 2024