Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AVRO-3666 [Java] Use the new schema parser #2642

Merged
merged 27 commits into from
Apr 4, 2024

Conversation

opwvhk
Copy link
Contributor

@opwvhk opwvhk commented Dec 19, 2023

What is the purpose of the change

This change updates schema parsing so forward references are handles in a uniform way, regardless of the parser used (JSON, IDL, ...).

This change is the missing bit for AVRO-3666, re-enabling some disabled tests (and more changes to debug). This also improves the schema resolving visitor copied from the original IDL parser (it didn't handle circular references in logical types), and aliases pointing to the null namespace (this was lost when querying aliases).

Verifying this change

This change updates a lot of tests to use the new schema parser (or the internal JSON parser), but does not alter tests. As such, the change is covered by existing schema parsing tests.

Documentation

  • Does this pull request introduce a new feature? (yes / no)
  • If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

This undoes the split schema parsing to allow forward references, which
is to be handles by the SchemaParser & ParseContext classes. It uses the
new ParseContext for the classic schema parser to accommodate this.

Next step: use the new SchemaParser and resolve unresolved / forward
references after parsing. This will also resolve "forward" references
that were parsed in subsequent files.
By resolving references after parsing, we allow both forward references
within a file as between subsequent files.

This change also includes using the new SchemaParser everywhere, as
using it is the best way to flush out bugs.
Also includes changes necessary to debug.
@opwvhk opwvhk requested a review from clesaec December 19, 2023 11:55
@github-actions github-actions bot added Java Pull Requests for Java binding build website labels Dec 19, 2023
Copy link
Contributor

@clesaec clesaec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments (not yet enough time to see all)

NameValidator saved = Schema.getNameValidator();
try {
Schema.setNameValidator(nameValidator); // Ensure we use the same validation.
HashMap<String, Schema> result = new LinkedHashMap<>(oldSchemas);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why LinkedHashMap here; where it needs to keep order ?
Map<String, Schema> result = new HashMap<>(oldSchemas);
is not enough ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is one of the things I'm not sure about: how to provide the parsed schemas. Ideally, a this would be a Set<Schema> (meaning a HashMap would suffice here).
However, current code uses a List<Schema>, and this is generally the more prevalent collection type.

lang/java/avro/src/main/java/org/apache/avro/Schema.java Outdated Show resolved Hide resolved
lang/java/avro/src/main/java/org/apache/avro/Schema.java Outdated Show resolved Hide resolved
Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to review this, but don't know where to start. The schema parser is very fundamental and there are 150+ files to review. My biggest concern is about breaking public API's. We had this in the Avro 1.8 to 1.9 version because we had to remove codehaus Jackson from the public API, and it was very hard to get this downstream into other projects.

@opwvhk
Copy link
Contributor Author

opwvhk commented Feb 14, 2024

I tried to review this, but don't know where to start. The schema parser is very fundamental and there are 150+ files to review. My biggest concern is about breaking public API's. We had this in the Avro 1.8 to 1.9 version because we had to remove codehaus Jackson from the public API, and it was very hard to get this downstream into other projects.

These are valid concerns, so I'll address them each.

Breaking API's:

  • Are not intended: all existing methods should be deprecated instead.
  • There's only one exception that I know of: Parser.parse(Iterable<File>) was added after the last release (AVRO-3805: Parse multiple schema #2375)

PR size:

  • Largely unavoidable, given how much we use the parser
  • Starting point is separating the parser and its tests from the rest
  • The parser requires extra scrutiny, the rest is of the category "it works, so it's good enough if it's readable"

This change reduces the PR size, but does require some extra work after
merging: the new SchemaParser class is hardly used, and the (now)
obsolete Schema.Parser class is used heavily.
@github-actions github-actions bot removed the build label Feb 27, 2024
@opwvhk opwvhk requested a review from Fokko February 27, 2024 15:32
Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good @opwvhk. Last time while going over it, it looked like a lot of public APIs were changed, it those are unreleased as you already mentioned in a comment. I also ran this branch against the Iceberg test-suite, and no issues there. Thanks for working on this! 🚀

@@ -18,25 +18,36 @@
package org.apache.avro;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a new file, so it is okay to break the APIs here

lang/java/avro/src/main/java/org/apache/avro/Schema.java Outdated Show resolved Hide resolved
@Fokko
Copy link
Contributor

Fokko commented Apr 3, 2024

@opwvhk Could you resolve the merge conflicts?

@Fokko Fokko merged commit 876eae3 into apache:main Apr 4, 2024
8 checks passed
@Fokko
Copy link
Contributor

Fokko commented Apr 4, 2024

Thanks for working on this @opwvhk 🙌

@g1rjeevan
Copy link

Hi Team,

Did 1.12.0 or 1.11.4 got released ? I see CVEs fixes related commons-compress 1.26.0 are already in place. Seems to be release is pending. If not released, what would be the ETA ? cc: @Fokko

@Fokko
Copy link
Contributor

Fokko commented Apr 10, 2024

A release is near, see the public mailing list: https://lists.apache.org/thread/qg70g7d9j5odkf8fxnnm342mm2kj997l I would say this month since the release process always takes some time in open source.

@g1rjeevan
Copy link

@Fokko Any update on the release ?

@jbonofre
Copy link
Member

jbonofre commented May 7, 2024

@g1rjeevan I'm working on both 1.11.4 and 1.12.0 releases preparation. I need to merge one pending change. I hope to submit the release to vote next week.

RanbirK pushed a commit to RanbirK/avro that referenced this pull request May 13, 2024
* AVRO-3666: Redo schema parsing code

This undoes the split schema parsing to allow forward references, which
is to be handles by the SchemaParser & ParseContext classes. It uses the
new ParseContext for the classic schema parser to accommodate this.

Next step: use the new SchemaParser and resolve unresolved / forward
references after parsing. This will also resolve "forward" references
that were parsed in subsequent files.

* AVRO-3666: Resolve references after parsing

By resolving references after parsing, we allow both forward references
within a file as between subsequent files.

This change also includes using the new SchemaParser everywhere, as
using it is the best way to flush out bugs.

* AVRO-3666: Remove wrong test

* AVRO-1535: Fix aliases as well

* AVRO-3666: Re-enable disabled test

Also includes changes necessary to debug.

* AVRO-3666: Fix RAT exclusion

The wrong exclusion was removed.

* AVRO-3666: Remove unused field

* AVRO-3666: Introduce SchemaParser.ParseResult

This ensures the SchemaParser never returns unresolved schemata.

* AVRO-3666: Use SchemaParser for documentation

* AVRO-3666: Refactor after review

* AVRO-3666: Fix javadoc

* AVRO-3666: Fix merge bug

* AVRO-3666: Fix CodeQL warnings

* AVRO-3666: Increase test coverage

* AVRO-3666: Fix tests

* AVRO-3666: Refactor schema parsing for readability

The JSON schema parser is quite complex (it is a large method). This
change splits it in multiple methods, naming the various stages.

* AVRO-3666: rename method to avoid confusion

* AVRO-3666: Reduce PR size

This change reduces the PR size, but does require some extra work after
merging: the new SchemaParser class is hardly used, and the (now)
obsolete Schema.Parser class is used heavily.

* AVRO-3666: Reduce PR size more

* AVRO-3666: Reduce PR size again

* AVRO-3666: Spotless

* Update lang/java/avro/src/main/java/org/apache/avro/Schema.java

Co-authored-by: Fokko Driesprong <[email protected]>

* AVRO-3666: Spotless

---------

Co-authored-by: Fokko Driesprong <[email protected]>
pvillard31 added a commit to pvillard31/nifi that referenced this pull request Oct 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Java Pull Requests for Java binding website
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants