Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AVRO-3918: Add UUID with fixed[16] #2652

Merged
merged 8 commits into from
Feb 2, 2024
Merged

Conversation

clesaec
Copy link
Contributor

@clesaec clesaec commented Dec 26, 2023

What is the purpose of the change

Proposal for AVRO-3918 to allow store uuid logical type in fixed or byte type.

Verifying this change

Unit test added

Documentation

  • Does this pull request introduce a new feature? (yes)
  • If yes, how is the feature documented? (no)

@clesaec clesaec requested a review from Fokko December 26, 2023 13:34
@github-actions github-actions bot added the Java Pull Requests for Java binding label Dec 26, 2023
@KalleOlaviNiemitalo
Copy link
Contributor

Please add to the specification, with an example of how the same UUID would be serialised as string and as fixed, so that there is no doubt about the byte order. Perhaps also notes about:

  • whether a UUID field with a sort order specified will be sorted differently if the underlying type is changed from string to fixed or vice versa
  • upper vs. lower case distinction not being preserved if UUID is not encoded as string
  • changing the underlying type not being supported in schema evolution

Please add a Java test for whether the byte order is correct, preferably using the same UUID as in the spec example.

@clesaec
Copy link
Contributor Author

clesaec commented Dec 26, 2023

  • Spec changed, sort order added.
  • upper vs. lower case has no meaning for "no string" type ? (on fixed where every byte can have value from 0x00 to 0xFF, it has undetermine behaviour)
  • Schema evolution; could may be added ? (when logical type is same but not underlying type)

Java test for whether the byte order is correct, preferably using the same UUID as in the spec example ?? Can't see uuid example in spec. TestUuidConversions is not enough ?

@KalleOlaviNiemitalo
Copy link
Contributor

TestUuidConversions is not enough?

It's not enough — it tests whether the encoding and the decoding are consistent with each other in the Java implementation, but if both have the wrong byte order against the spec, then the test might not notice.

@clesaec
Copy link
Contributor Author

clesaec commented Dec 27, 2023

I added number encoded control (big endian vs little endian)

@Fokko
Copy link
Contributor

Fokko commented Dec 28, 2023

Thanks @clesaec for working on this, this is great!

@KalleOlaviNiemitalo Thanks for chiming in here. It would be great to get this on the devlist as well: https://lists.apache.org/thread/h4lvc2lk194om6cjm76wzosm22cop2xw

Or on the Google doc: https://docs.google.com/document/d/16_oSWrEM7AFUCTe0uuraAEHxywezLfoEz5ahzwvhGUk/edit#heading=h.43xuauwfk7ow

You raised some valid points, but I think we don't sort on the actual physical type, but rather on the logical type. Meaning that how you store them, is more of an implementation detail than something that should be visible to the user. Java should just sort on a java.util.UUID.

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome @clesaec 🙌

doc/content/en/docs/++version++/Specification/_index.md Outdated Show resolved Hide resolved
doc/content/en/docs/++version++/Specification/_index.md Outdated Show resolved Hide resolved
public UUID fromBytes(final ByteBuffer value, final Schema schema, final LogicalType type) {
BinaryDecoder decoder = DecoderFactory.get().binaryDecoder(value.array(), null);
try {
final long mostSigBits = decoder.readLong();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Iceberg we first read it to a buffer, but I think your version here is more efficient. Let me do some testing.

@clesaec
Copy link
Contributor Author

clesaec commented Dec 29, 2023

@Fokko: I removed inefficient bytes storage

@clesaec clesaec requested a review from Fokko January 2, 2024 08:16
@clesaec
Copy link
Contributor Author

clesaec commented Jan 2, 2024

@Fokko : Merge this PR if you find it OK; otherwise, copy branch; i won't be able to merge it this afternoon

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work @clesaec I think this is almost good except one concern of inferring the Endianness of the system.

I would like to wait for more responses on the devlist before merging this. I know it is open for a while, but is also holiday season so I would like to give it a few more days.

As stated in RFC 4122 section 4.1.2, UUIDs are in network byte order.

Also added a test for string based UUID conversion.
Comment on lines 830 to 848
A `uuid` logical type annotates an Avro `string` or `fixed` of length 16. The string has to conform with [RFC-4122](https://www.ietf.org/rfc/rfc4122.txt)

The following schemas represent a uuid:
```json
{
"type": "string",
"logicalType": "uuid"
}
```

```json
{
"type": "fixed",
"size": "16",
"logicalType": "uuid"
}
```

(UUID will be sorted differently if the underlying type is a string or a fixed)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's wait until PR #2672 is merged, and take that as documentation.

@opwvhk opwvhk self-assigned this Jan 24, 2024
@Fokko
Copy link
Contributor

Fokko commented Jan 29, 2024

@clesaec I know you're busy, do you have time to rebase this PR? 🙌

@clesaec
Copy link
Contributor Author

clesaec commented Jan 29, 2024

Pb is not really that I'm busy, it's that as I changed job, I have now not much time for avro.
I will try to find time this week, but no guarantee.

@Fokko
Copy link
Contributor

Fokko commented Jan 29, 2024

@clesaec Thanks for the quick reply. If you don't have time, let me know. I'm happy to cherry-pick your work and create a new PR. Otherwise this week is fine as well 👍

@martin-g
Copy link
Member

martin-g commented Jan 30, 2024

@Fokko As a Avro maintainer you can push to people's PRs!
In the top-right corner of this page there is a Code dropdown button. Open its Local tab and you will see gh pr checkout 2652 - this is the Github CLI command to fetch the branch locally. Then you can make modifications and git push them.

@opwvhk
Copy link
Contributor

opwvhk commented Jan 30, 2024

@Fokko As a Avro maintainer you can push to people's PRs!

In this case, the syntax is: git push [email protected]:clesaec/avro.git avro-3918_Uuid:avro-3918_Uuid

@Fokko
Copy link
Contributor

Fokko commented Jan 31, 2024

@martin-g @opwvhk Thanks for the pointers here! I was not aware of this 👍

@Fokko Fokko changed the title AVRO-3918: add uuid with bytes and fixed AVRO-3918: Add UUID with fixed[16] Feb 1, 2024
@Fokko Fokko merged commit b0e44bd into apache:main Feb 2, 2024
16 checks passed
@sk-
Copy link

sk- commented May 8, 2024

Hi @clesaec, could you expand on the reason you removed support for bytes with logical type uuid in c9f4dea.

I understand is more efficient to store them in a fixed bytes. However, in our company we were already using bytes to represent uuids, so it would be greatly beneficial, if we could also use logical types.

Speaking of efficiency, it wouldn't be worse that the already existing string format, which uses even more bytes.

RanbirK pushed a commit to RanbirK/avro that referenced this pull request May 13, 2024
* AVRO-3918: add uuid with bytes and fixed

* AVRO-3918: add licence

* AVRO-3918: change spec

* AVRO-3918: force big endian mode for long value

* AVRO-3918: remove inefficient uuid bytes storage

* AVRO-3918: enforce network byte order

As stated in RFC 4122 section 4.1.2, UUIDs are in network byte order.

Also added a test for string based UUID conversion.

* Use buffer instead

---------

Co-authored-by: Oscar Westra van Holthe - Kind <[email protected]>
Co-authored-by: Fokko Driesprong <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Java Pull Requests for Java binding
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants