Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AVRO-3884: Add local-timestamp-nanos and timestamp-nanos #2554

Merged
merged 6 commits into from
Dec 7, 2023

Conversation

Fokko
Copy link
Contributor

@Fokko Fokko commented Oct 16, 2023

AVRO-3884

What is the purpose of the change

Within certain industries nano timestamps are the common practice (finance for example), therefore I would propose adding this to the Avro spec as well.

Nanosecond datetime precision is needed in various fields and applications for several reasons, particularly in scenarios where extremely precise timing is critical. Here are a few reasons why nanosecond datetime precision is essential:

  1. Financial Transactions: In the world of high-frequency trading and financial markets, nanosecond precision is necessary to accurately record and timestamp transactions. Small time differentials can result in significant advantages or losses in trading, making precise timestamps crucial for maintaining fairness and integrity in financial markets.

  2. Scientific Research: Many scientific experiments and measurements require extremely high precision in time recording. Fields like particle physics, astronomy, and chemistry may deal with processes that occur at the nanosecond level, where the exact timing of events is vital for accurate data analysis.

  3. Telecommunications: In telecommunications, especially for systems operating at high frequencies, nanosecond precision is required to synchronize various network components and ensure the efficient and reliable transfer of data.

  4. GPS and Navigation: Global Positioning System (GPS) technology relies on nanosecond precision to calculate the time it takes for signals to travel from satellites to receivers. Accurate timekeeping is crucial for determining precise locations and distances.

  5. Aerospace and Defense: In aviation and defense applications, nanosecond precision is needed for tasks such as navigation, missile guidance, and radar systems, where small timing errors can have serious consequences.

  6. Data Centers and Distributed Systems: Modern data centers and distributed systems often require nanosecond precision for synchronization and coordination to ensure efficient and reliable operations.

  7. Simulation and Modeling: Computer simulations and modeling in various fields, including engineering, climate science, and fluid dynamics, may need nanosecond-level precision to accurately represent real-world processes and interactions.

  8. Network Protocols: Some network protocols and technologies, such as Ethernet, use nanosecond precision to time-stamp packets for accurate sequencing and synchronization in communication networks.

  9. Cryptography and Security: In cryptographic applications, precise timing can be crucial for ensuring secure authentication and encryption processes, and nanosecond precision can help protect against various timing-based attacks.

  10. High-Performance Computing: Supercomputers and other high-performance computing systems require nanosecond precision for tasks like benchmarking, profiling, and optimizing code to improve overall performance.

In these and other applications, nanosecond datetime precision is essential for ensuring accurate data recording, synchronization, and the reliable functioning of systems and processes that rely on precise timing.

Languages

>>> import numpy as np
>>> np.datetime64(1697631171861735496, 'ns')
numpy.datetime64('2023-10-18T12:12:51.861735496')
>>> import pandas as pd
>>> pd.Timestamp(1697631171861735496)  # nano's by default :)
Timestamp('2023-10-18 12:12:51.861735496')
  • Rust: Native support, and accepts a i128 using from_nanos.
  • Javascript: Comes with millisecond precision out of the box, but there are third-party libraries available such as timestamp-nano. It can read from 64-bit byte arrays (both little and big-endian).
  • Perl: Available through a third-party package HiRes.
  • PHP: Not available. There is the function hrtime to get the current time in nano's either returns a tuple of (seconds, nanos), or can return an int64 on 64-bit systems, or a float on 32 bit systems. For languages like this, we could just return the integer value (depending on the physical type).
  • Ruby: Uses nanoseconds in their own Time object.

Primitive type consideration

  • int64: Widely available, but limited precision: [1677-09-21 00:12:43.145224193, 2262-04-11 23:47:16.854775807].
  • int128: Wide range, but might require external libraries, such as GMP for C++, or performance trade-offs because it needs BigInteger in Java (as opposed to primitives).
  • decimal(n, 0): Arbitrary range, but has the same issue as int128.

I'm leaning towards starting with int64 (and maybe int128). We could always promote int64 → int128 (and that's also binary compatible thanks to the zigzag encoding).

Verifying this change

(Please pick one of the following options)

This change is a trivial rework / code cleanup without any test coverage.

(or)

This change is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

  • Extended interop tests to verify consistent valid schema names between SDKs
  • Added test that validates that Java throws an AvroRuntimeException on invalid binary data
  • Manually verified the change by building the website and checking the new redirect

Documentation

  • Does this pull request introduce a new feature? (yes / no)
  • If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

@@ -862,6 +862,11 @@ The `timestamp-micros` logical type represents an instant on the global timeline

A `timestamp-micros` logical type annotates an Avro `long`, where the long stores the number of microseconds from the unix epoch, 1 January 1970 00:00:00.000000 UTC.

### Timestamp (nanosecond precision)
The `timestamp-nanos` logical type represents an instant on the global timeline, independent of a particular time zone or calendar, with a precision of one nanosecond. Please note that time zone information gets lost in this process. Upon reading a value back, we can only reconstruct the instant, but not the original representation. In practice, such timestamps are typically displayed to users in their local time zones, therefore they may be displayed differently depending on the execution environment.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's accurate to say that "time zone information gets lost in this process" because the type is independent of a zone. I also would not refer to "the instant". Assuming that this logical type corresponds to TIMESTAMP(9) WITHOUT TIME ZONE, I would say that any statement should be the displayed value must never be modified with respect to the system time zone because it has no time zone.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, and I think it can be confusing to refer to a timezone at all. I copied this both from millis and micros, do we want to deviate from that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I do not agree with the statement that "time zone information gets lost in this process": the paragraph below explicitly states the time zone in use.

This is different from the logical type local-timestamp-nanos below, that does not have time zone information.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timestamp logical types should have examples to clarify the semantics. No need to repeat those examples for each of -millis, -micros, and -nanos though.

Given an event at noon local time on January 1, 2000, in Helsinki where the local time was two hours east of UTC:

  • For timestamp-millis, the timestamp is converted to UTC 2000-01-01T10:00:00 and that is then converted to Avro long (fill in the number).
  • For local-timestamp-millis, the timestamp is kept in local time 2000-01-01T12:00:00 and that is then converted to Avro long (fill in the number).

In either case, the schema author may add a separate field for the time zone offset (+02:00) or a time zone identifier (Europe/Helsinki), or the recipient of the data may know these via some offband agreement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks everyone for the great input here. I think we all agree that this needs some reworking.

Actually, I do not agree with the statement that "time zone information gets lost in this process": the paragraph below explicitly states the time zone in use.

The timezone is always UTC. But the local timezone that the writer lives in, is lost. I would suggest removing this sentence since it is confusing. Any objections?

The timestamp logical types should have examples to clarify the semantics. No need to repeat those examples for each of -millis, -micros, and -nanos though.

I agree there, and I also like the examples. I've restructured the documentation to remove the duplicate sections.

@@ -872,6 +877,11 @@ The `local-timestamp-micros` logical type represents a timestamp in a local time

A `local-timestamp-micros` logical type annotates an Avro `long`, where the long stores the number of microseconds, from 1 January 1970 00:00:00.000000.

### Local timestamp (nanosecond precision)
The `local-timestamp-nanos` logical type represents a timestamp in a local timezone, regardless of what specific time zone is considered local, with a precision of one nanosecond.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this related to the Iceberg work? I don't think that we would want to use this type for timestamptz_ns because we don't consider that a "local" timestamp.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No this is unrelated to Iceberg, but because we have this there is also a local-timestamp-{millis,micros} then I think people also expect the local equivalent.

@KalleOlaviNiemitalo
Copy link
Contributor

The C# library translates the other timestamp logical types to the DateTime and DateTimeOffset types; however, those have a precision of 100 nanoseconds, so they wouldn't be able to represent all the timestamp-nanos and local-timestamp-nanos values exactly. I guess the library would then have to define new types for these purposes, with explicit conversion operators to and from DateTime. (The conversion to DateTime should be explicit because it can lose precision, and the conversion from DateTime should be explicit because it can overflow.)

@jacobmarble
Copy link
Contributor

I'm here to signal support this PR.

InfluxData would like to make our customers' InfluxDB data, stored in S3, directly accessible to customers via Apache Iceberg. However, we can't do this without (1) rewriting all customer data with microsecond timestamps or (2) updating the Apache Iceberg spec to allow for nanosecond-precision timestamps.

This change helps unblock the relevant change to Apache Iceberg. Thanks!

Co-authored-by: Jacob Marble <[email protected]>
Copy link
Member

@martin-g martin-g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the Slack discussion:

https://issues.apache.org/jira/browse/AVRO-3884 - empty description !
https://github.com/apache/avro/pull/2554 - empty description !
IMO spec changes should be discussed first on the mailing list and better explained !

@Fokko
Copy link
Contributor Author

Fokko commented Oct 17, 2023

@martin-g I'm sorry, I thought it is self-explanatory. I sent out an email on the dev-list yesterday but also cross-posted on Slack today to get more eyes on it. I went ahead with the PR as this helps me personally with the discussion.

@zcsizmadia
Copy link
Contributor

IMO this requires a deeper discussion, than just adding it to the spec. The current time related types are tied to language specific time objects. Most architectures do not support the nanosecond resolution what this type would require. That needs to be investigated. What can be done in different architectures, OSs and languages.

I understand the need for ns measurements in applications, however those are ususally done by specific hardware/software support.

Without having deeper discussions, my gut feeling is that a nano time type should be only a a 64/128(?) bit long integer since the real world is a wild west when it comes to that precision. Ergo might not require a new type. Of course I am open to any discussions about this.

@Fokko
Copy link
Contributor Author

Fokko commented Oct 20, 2023

@zcsizmadia Thanks for jumping in here. I've drafted a document: https://docs.google.com/document/d/10syT_J5ZoJ23wvzfYCaPTY7U89OmgS6ZmU7GeGWria8/edit

Feel free to comment on the document or reply on the devlist.

Without having deeper discussions, my gut feeling is that a nano time type should be only a a 64/128(?) bit long integer since the real world is a wild west when it comes to that precision. Ergo might not require a new type. Of course I am open to any discussions about this.

I fully agree here. The INT96 didn't really land, and Parquet now also uses INT64.

@martin-g
Copy link
Member

Thank you for the PR description and the Google doc, @Fokko !

@martin-g martin-g dismissed their stale review October 20, 2023 13:46

description is provided

@martin-g
Copy link
Member

Does anyone know what is the idea behind having both local-timestamp-** and timestamp-** if the timezone is not encoded with the integer ?
Since the application should deal with the timezone then why do we need two types ?

@Fokko
Copy link
Contributor Author

Fokko commented Oct 20, 2023

@martin-g Great question. The history originates from Hive where you have multiple Timezone types. They store the same information, but when writing the fields they behave differently. They have an excellent doc on this that explains the different types.

To make it even more confusing:

  • timestamp-** maps to TIMESTAMP WITH LOCAL TIME ZONE, or Java Instant
  • local-timestamp-** maps to TIMESTAMP WITHOUT TIME ZONE, or Java LocalDateTime

@opwvhk
Copy link
Contributor

opwvhk commented Oct 21, 2023

Does anyone know what is the idea behind having both local-timestamp-** and timestamp-** if the timezone is not encoded with the integer ? Since the application should deal with the timezone then why do we need two types ?

Sometimes, especially with legacy systems, the timezone is 'understood'. Usually because originally there was only one timezone (of the country the company operated in). This starts to break down of course when you go to an international setting, but rewriting is terribly expensive and often outside of any budget.

In such situations I may chose the local time(stamp) types as a warning signal.

@martin-g
Copy link
Member

The problem is that in the Avro data format there is no timezone data in both types. It is left to the application to provide it, e.g. by storing it in a sibling field.

My question is: Do we need both types timestamp-xyz and local-timestamp-xyz ? It looks to me that it is up to the application logic to decide how to interpret the long/i64 value.

We are going to add one more pair of such types!

I understand that it is impossible to remove an existing type from the spec (v1)! And adding a new type should be consistent with the old ones! So there is no much that can be done at the moment but to acknowledge the situation.

The best would be the timestamp-xyz types to encode both the long and the timezone in one field. Then the Avro SDKs could provide help with deserializing it to language specific classes, e.g. OffsetDateTime in Java.

@KalleOlaviNiemitalo
Copy link
Contributor

The best would be the timestamp-xyz types to encode both the long and the timezone in one field.

Would the field contain an offset from UTC, or a time zone identifier from the IANA tz database? For a future timestamp, a time zone identifier would allow the local time to be reconverted to UTC each time the government changes the rules of the time zone. If the type were used only for past timestamps, then an offset would suffice, I guess.

@martin-g
Copy link
Member

Would the field contain an offset from UTC, or a time zone identifier from the IANA tz database?

I didn't think thoroughly about specifics! Mostly because I think it is impossible to do anything about this in v1 of the spec.

My point is that we are going to add more unnecessary pair to the current spec :-/
But if we are going to add a new type then it must be a pair, for consistency with the previous ones.

@Fokko
Copy link
Contributor Author

Fokko commented Oct 23, 2023

The problem is that in the Avro data format, there is no timezone data in both types. It is left to the application to provide it, e.g. by storing it in a sibling field.

This is correct. The Hive doc also mentioned the TIMESTAMP WITH TIME ZONE that captures this behavior.

The best would be the timestamp-xyz types to encode both the long and the timezone in one field. Then the Avro SDKs could provide help with deserializing it to language-specific classes, e.g. OffsetDateTime in Java.

I don't agree about storing the timezone. I don't think that is something that should be done by Avro because it is a huge can of worms. In general, to maintain engineers' sanity, it is best to normalize everything to UTC when storing the data.

Having a way to store the timezone as well would require to:

  • Apply the timezone first before being able to do any comparison on it
  • Handle daylight savings? (Yes, we're in that time of year again).
  • Historical changes to the timestamps as @KalleOlaviNiemitalo already pointed out.

If we want this, this should be a separate proposal and would introduce new types because we can't alter the existing ones as you already mentioned.

My point is that we are going to add more unnecessary pair to the current spec :-/

Fair question, it would allow people who use existing local timestamps to migrate to nano precision.

@martin-g
Copy link
Member

In general, to maintain engineers' sanity, it is best to normalize everything to UTC when storing the data.

But this is done in the application code, right ?
I just fail to see what is the benefit of having timestamp-x and local-timestamp-x in the context of Avro. Both are plain integers and if any normalization and calculations should be done then it is completely in the application code.

@Fokko
Copy link
Contributor Author

Fokko commented Oct 23, 2023

This is a bit ambiguous indeed, you need to supply an Instant for the timestamp and a LocalDateTime for the local-timestamp. So in your application code, you need to make sure that you use the right objects. Having only timestamp would impose the responsibility of converting the LocalDateTime to an Instant on the developer. Having this distinction makes sure that a developer is less likely to store a Instant into a local-timestamp, without seeing anything off.

@RyanSkraba
Copy link
Contributor

RyanSkraba commented Oct 27, 2023

Hello! I just wanted to add some reference materials that might be useful to understanding where we are today!

  1. The AVRO local-timestamp-x design documents are at https://issues.apache.org/jira/browse/AVRO-2328
  2. This document (especially Appendix 2) gives some pretty compelling arguments for including the local-timestamp-x types.
  3. It builds on another, more detailed document "Consistent timestamp types in Hadoop SQL engines" that has a wider scope across Big Data SQL engines.

(I refer to these document more than I care to admit...)

My subjective opinion is that I'd prefer to resist adding new LogicalTypes, but this is a pretty straightforward case. I really don't see any disadvantages to extending the existing timestamp types with more precision. Any SDK that can't or won't implement them will just have the nanos int64 to fall back on without any loss of precision.

@kojiromike
Copy link
Contributor

Just a quick note: I pointed out in the google doc, but Python does have support for nanos in the stdlib.

@Fokko
Copy link
Contributor Author

Fokko commented Nov 9, 2023

@kojiromike thank you for the addition, I've updated the doc. If you don't have any further concerns and are in favor of the spec change, please approve the PR :)

Copy link
Contributor

@RyanSkraba RyanSkraba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the rewording. The spec is clearer with this change.

I'm very tempted to say that some of the documentation you've written in support of this PR could/should be on the website as supporting materials -- what do you think? You've put some effort into writing it up, but it could also be a good first issue for a new contributor!

@Fokko Fokko merged commit c3c41fb into apache:main Dec 7, 2023
2 checks passed
@Fokko Fokko deleted the fd-add-nanos branch December 7, 2023 12:26
@Fokko
Copy link
Contributor Author

Fokko commented Dec 7, 2023

Moving this forward, thanks everyone for the input 🙌


A `local-timestamp-micros` logical type annotates an Avro `long`, where the long stores the number of microseconds, from 1 January 1970 00:00:00.000000.
Example: Given an event at noon local time (12:00) on January 1, 2000, in Helsinki where the local time was two hours east of UTC (UTC+2). The timestamp is converted to Avro long 946684800000 (milliseconds) and then written.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Fokko Is the example value 946684800000 correct? This corresponds to 2000-01-01 00:00:00 UTC.

I was expecting that the value would be 946728000000 which is 2000-01-01 12:00:00 UTC, i.e. 2000-01-01 12:00:00 +0200 converted directly to milliseconds without taking the timezone into account. But maybe I'm misinterpreting local timestamp concept.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Fokko Ping!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Fokko Is the example value 946684800000 correct? This corresponds to 2000-01-01 00:00:00 UTC.

Yes, you're right. It should be 946728000000.

The concept of the local-timestamp is each value is a recording of what can be seen on a calendar and a clock hanging on the wall, for example "1969-07-20 16:17:39". It can be decomposed into year, month, day, hour, minute and seconds fields, but with no time zone information available, it does not correspond to any specific point in time. It is often used in legacy systems.

Thanks @martin-g for pinging me. I was off that week, and the notification must be somewhere deep down in my mailbox.

RanbirK pushed a commit to RanbirK/avro that referenced this pull request May 13, 2024
…2554)

* AVRO-3884: Add `local-timestamp-nanos` and `timestamp-nanos`

* Add zeros

Co-authored-by: Jacob Marble <[email protected]>

* Update precision

Co-authored-by: Jacob Marble <[email protected]>

* Combine the datetimes and rework the wording

* Remove then

* Update doc/content/en/docs/++version++/Specification/_index.md

---------

Co-authored-by: Jacob Marble <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants