Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AVRO-3930: Mapping org.apache.hadoop.io.ZStandardCodec to zstandard #2707

Merged
merged 1 commit into from
Feb 12, 2024

Conversation

yaooqinn
Copy link
Member

What is the purpose of the change

Mapping org.apache.hadoop.io.ZStandardCodec to zstandard in
HadoopCodecFactory

Verifying this change

This change added tests

Documentation

  • Does this pull request introduce a new feature? (yes)
  • If yes, how is the feature documented? (JavaDocs)

@github-actions github-actions bot added the Java Pull Requests for Java binding label Jan 24, 2024
@martin-g
Copy link
Member

Not related to this PR but I wonder how CodecFactory.fromString("deflate"); works ?!
In one of the tests it returns DeflateCodec.class, while in another it returns GzipCodec.class

@KalleOlaviNiemitalo
Copy link
Contributor

@martin-g, what's the test in which CodecFactory.fromString("deflate") returns GZipCodec.class? To me, it looks like there is no GZipCodec class, and HadoopCodecFactory uses the DeflateCodec class for both "org.apache.hadoop.io.compress.DeflateCodec" and "org.apache.hadoop.io.compress.GZipCodec", whereas CodecFactory doesn't have this kind of aliasing built-in.

@martin-g
Copy link
Member

@KalleOlaviNiemitalo I meant these two tests:

I didn't fire an IDE to debug what is going on. It seems one is an alias to the other...

@KalleOlaviNiemitalo
Copy link
Contributor

The compress method in org.apache.avro.file.DeflateCodec calls getDeflater, which then passes the nowrap flag to java.util.zip.Deflater, and the value of nowrap is always true, so it will omit the GZip header. It seems wrong to me that "org.apache.hadoop.io.compress.GZipCodec" is mapped to that implementation, especially as GZipCodec in Hadoop uses ZlibCompressor.CompressionHeader.GZIP_FORMAT. OTOH, GZipCodec in Hadoop apparently supports both formats when decompressing, so the mismatch is perhaps not causing problems in practice.

@KalleOlaviNiemitalo
Copy link
Contributor

Oh and Hadoop spells "GzipCodec" with a lowercase "z", but HadoopCodecFactory in Avro registers "GZipCodec" with an uppercase "Z". Perhaps it is already not interoperable, then.

@yaooqinn
Copy link
Member Author

Thanks for the review, @martin-g, and @KalleOlaviNiemitalo.

Shall I fix the GzipCodec issue here? Or with a follow-up, as it seems orthogonal?

@martin-g
Copy link
Member

In a separate JIRA+PR. Thank you!

@yaooqinn
Copy link
Member Author

Hi, @martin-g @KalleOlaviNiemitalo. Do we have any comment I missed to address for this PR?

@martin-g
Copy link
Member

@yaooqinn I'll leave it to the Java SDK maintainers to review and merge.

@yaooqinn
Copy link
Member Author

OK. I got it. Thank you @martin-g

@nielsbasjes nielsbasjes self-requested a review February 12, 2024 11:05
@nielsbasjes nielsbasjes merged commit aa20739 into apache:main Feb 12, 2024
16 checks passed
@yaooqinn yaooqinn deleted the AVRO-3930 branch February 19, 2024 03:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Java Pull Requests for Java binding
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants