-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AVRO-3918: Add UUID with fixed[16]
#2652
Conversation
Please add to the specification, with an example of how the same UUID would be serialised as
Please add a Java test for whether the byte order is correct, preferably using the same UUID as in the spec example. |
|
It's not enough — it tests whether the encoding and the decoding are consistent with each other in the Java implementation, but if both have the wrong byte order against the spec, then the test might not notice. |
I added number encoded control (big endian vs little endian) |
Thanks @clesaec for working on this, this is great! @KalleOlaviNiemitalo Thanks for chiming in here. It would be great to get this on the devlist as well: https://lists.apache.org/thread/h4lvc2lk194om6cjm76wzosm22cop2xw Or on the Google doc: https://docs.google.com/document/d/16_oSWrEM7AFUCTe0uuraAEHxywezLfoEz5ahzwvhGUk/edit#heading=h.43xuauwfk7ow You raised some valid points, but I think we don't sort on the actual physical type, but rather on the logical type. Meaning that how you store them, is more of an implementation detail than something that should be visible to the user. Java should just sort on a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome @clesaec 🙌
lang/java/avro/src/test/java/org/apache/avro/TestLogicalType.java
Outdated
Show resolved
Hide resolved
public UUID fromBytes(final ByteBuffer value, final Schema schema, final LogicalType type) { | ||
BinaryDecoder decoder = DecoderFactory.get().binaryDecoder(value.array(), null); | ||
try { | ||
final long mostSigBits = decoder.readLong(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Iceberg we first read it to a buffer, but I think your version here is more efficient. Let me do some testing.
@Fokko: I removed inefficient bytes storage |
@Fokko : Merge this PR if you find it OK; otherwise, copy branch; i won't be able to merge it this afternoon |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the work @clesaec I think this is almost good except one concern of inferring the Endianness of the system.
I would like to wait for more responses on the devlist before merging this. I know it is open for a while, but is also holiday season so I would like to give it a few more days.
lang/java/avro/src/test/java/org/apache/avro/TestUuidConversions.java
Outdated
Show resolved
Hide resolved
As stated in RFC 4122 section 4.1.2, UUIDs are in network byte order. Also added a test for string based UUID conversion.
A `uuid` logical type annotates an Avro `string` or `fixed` of length 16. The string has to conform with [RFC-4122](https://www.ietf.org/rfc/rfc4122.txt) | ||
|
||
The following schemas represent a uuid: | ||
```json | ||
{ | ||
"type": "string", | ||
"logicalType": "uuid" | ||
} | ||
``` | ||
|
||
```json | ||
{ | ||
"type": "fixed", | ||
"size": "16", | ||
"logicalType": "uuid" | ||
} | ||
``` | ||
|
||
(UUID will be sorted differently if the underlying type is a string or a fixed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's wait until PR #2672 is merged, and take that as documentation.
@clesaec I know you're busy, do you have time to rebase this PR? 🙌 |
Pb is not really that I'm busy, it's that as I changed job, I have now not much time for avro. |
@clesaec Thanks for the quick reply. If you don't have time, let me know. I'm happy to cherry-pick your work and create a new PR. Otherwise this week is fine as well 👍 |
@Fokko As a Avro maintainer you can push to people's PRs! |
In this case, the syntax is: |
fixed[16]
Hi @clesaec, could you expand on the reason you removed support for bytes with logical type uuid in c9f4dea. I understand is more efficient to store them in a fixed bytes. However, in our company we were already using bytes to represent uuids, so it would be greatly beneficial, if we could also use logical types. Speaking of efficiency, it wouldn't be worse that the already existing string format, which uses even more bytes. |
* AVRO-3918: add uuid with bytes and fixed * AVRO-3918: add licence * AVRO-3918: change spec * AVRO-3918: force big endian mode for long value * AVRO-3918: remove inefficient uuid bytes storage * AVRO-3918: enforce network byte order As stated in RFC 4122 section 4.1.2, UUIDs are in network byte order. Also added a test for string based UUID conversion. * Use buffer instead --------- Co-authored-by: Oscar Westra van Holthe - Kind <[email protected]> Co-authored-by: Fokko Driesprong <[email protected]>
What is the purpose of the change
Proposal for AVRO-3918 to allow store uuid logical type in fixed or byte type.
Verifying this change
Unit test added
Documentation