AVRO-3884: Add `local-timestamp-nanos` and `timestamp-nanos` #2554

Fokko · 2023-10-16T16:24:01Z

What is the purpose of the change

Within certain industries nano timestamps are the common practice (finance for example), therefore I would propose adding this to the Avro spec as well.

Nanosecond datetime precision is needed in various fields and applications for several reasons, particularly in scenarios where extremely precise timing is critical. Here are a few reasons why nanosecond datetime precision is essential:

Financial Transactions: In the world of high-frequency trading and financial markets, nanosecond precision is necessary to accurately record and timestamp transactions. Small time differentials can result in significant advantages or losses in trading, making precise timestamps crucial for maintaining fairness and integrity in financial markets.
Scientific Research: Many scientific experiments and measurements require extremely high precision in time recording. Fields like particle physics, astronomy, and chemistry may deal with processes that occur at the nanosecond level, where the exact timing of events is vital for accurate data analysis.
Telecommunications: In telecommunications, especially for systems operating at high frequencies, nanosecond precision is required to synchronize various network components and ensure the efficient and reliable transfer of data.
GPS and Navigation: Global Positioning System (GPS) technology relies on nanosecond precision to calculate the time it takes for signals to travel from satellites to receivers. Accurate timekeeping is crucial for determining precise locations and distances.
Aerospace and Defense: In aviation and defense applications, nanosecond precision is needed for tasks such as navigation, missile guidance, and radar systems, where small timing errors can have serious consequences.
Data Centers and Distributed Systems: Modern data centers and distributed systems often require nanosecond precision for synchronization and coordination to ensure efficient and reliable operations.
Simulation and Modeling: Computer simulations and modeling in various fields, including engineering, climate science, and fluid dynamics, may need nanosecond-level precision to accurately represent real-world processes and interactions.
Network Protocols: Some network protocols and technologies, such as Ethernet, use nanosecond precision to time-stamp packets for accurate sequencing and synchronization in communication networks.
Cryptography and Security: In cryptographic applications, precise timing can be crucial for ensuring secure authentication and encryption processes, and nanosecond precision can help protect against various timing-based attacks.
High-Performance Computing: Supercomputers and other high-performance computing systems require nanosecond precision for tasks like benchmarking, profiling, and optimizing code to improve overall performance.

In these and other applications, nanosecond datetime precision is essential for ensuring accurate data recording, synchronization, and the reliable functioning of systems and processes that rely on precise timing.

Languages

Java: Supports nanoseconds using Instant.ofEpochSecond.
.Net: The C# library translates the other timestamp logical types to the DateTime and DateTimeOffset types; however, those have a precision of 100 nanoseconds. details.
Python: Python does not support nano's, but pandas and numpy do:

>>> import numpy as np
>>> np.datetime64(1697631171861735496, 'ns')
numpy.datetime64('2023-10-18T12:12:51.861735496')
>>> import pandas as pd
>>> pd.Timestamp(1697631171861735496)  # nano's by default :)
Timestamp('2023-10-18 12:12:51.861735496')

Rust: Native support, and accepts a i128 using from_nanos.
Javascript: Comes with millisecond precision out of the box, but there are third-party libraries available such as timestamp-nano. It can read from 64-bit byte arrays (both little and big-endian).
Perl: Available through a third-party package HiRes.
PHP: Not available. There is the function hrtime to get the current time in nano's either returns a tuple of (seconds, nanos), or can return an int64 on 64-bit systems, or a float on 32 bit systems. For languages like this, we could just return the integer value (depending on the physical type).
Ruby: Uses nanoseconds in their own Time object.

Primitive type consideration

int64: Widely available, but limited precision: [1677-09-21 00:12:43.145224193, 2262-04-11 23:47:16.854775807].
int128: Wide range, but might require external libraries, such as GMP for C++, or performance trade-offs because it needs BigInteger in Java (as opposed to primitives).
decimal(n, 0): Arbitrary range, but has the same issue as int128.

I'm leaning towards starting with int64 (and maybe int128). We could always promote int64 → int128 (and that's also binary compatible thanks to the zigzag encoding).

Verifying this change

(Please pick one of the following options)

This change is a trivial rework / code cleanup without any test coverage.

(or)

This change is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

Extended interop tests to verify consistent valid schema names between SDKs
Added test that validates that Java throws an AvroRuntimeException on invalid binary data
Manually verified the change by building the website and checking the new redirect

Documentation

Does this pull request introduce a new feature? (yes / no)
If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

rdblue · 2023-10-16T16:42:35Z

doc/content/en/docs/++version++/Specification/_index.md

@@ -862,6 +862,11 @@ The `timestamp-micros` logical type represents an instant on the global timeline

 A `timestamp-micros` logical type annotates an Avro `long`, where the long stores the number of microseconds from the unix epoch, 1 January 1970 00:00:00.000000 UTC.

+### Timestamp (nanosecond precision)
+The `timestamp-nanos` logical type represents an instant on the global timeline, independent of a particular time zone or calendar, with a precision of one nanosecond. Please note that time zone information gets lost in this process. Upon reading a value back, we can only reconstruct the instant, but not the original representation. In practice, such timestamps are typically displayed to users in their local time zones, therefore they may be displayed differently depending on the execution environment.


I don't think it's accurate to say that "time zone information gets lost in this process" because the type is independent of a zone. I also would not refer to "the instant". Assuming that this logical type corresponds to TIMESTAMP(9) WITHOUT TIME ZONE, I would say that any statement should be the displayed value must never be modified with respect to the system time zone because it has no time zone.

I agree, and I think it can be confusing to refer to a timezone at all. I copied this both from millis and micros, do we want to deviate from that?

Actually, I do not agree with the statement that "time zone information gets lost in this process": the paragraph below explicitly states the time zone in use.

This is different from the logical type local-timestamp-nanos below, that does not have time zone information.

The timestamp logical types should have examples to clarify the semantics. No need to repeat those examples for each of -millis, -micros, and -nanos though.

Given an event at noon local time on January 1, 2000, in Helsinki where the local time was two hours east of UTC:

For timestamp-millis, the timestamp is converted to UTC 2000-01-01T10:00:00 and that is then converted to Avro long (fill in the number).

For local-timestamp-millis, the timestamp is kept in local time 2000-01-01T12:00:00 and that is then converted to Avro long (fill in the number).

In either case, the schema author may add a separate field for the time zone offset (+02:00) or a time zone identifier (Europe/Helsinki), or the recipient of the data may know these via some offband agreement.

Thanks everyone for the great input here. I think we all agree that this needs some reworking.

Actually, I do not agree with the statement that "time zone information gets lost in this process": the paragraph below explicitly states the time zone in use.

The timezone is always UTC. But the local timezone that the writer lives in, is lost. I would suggest removing this sentence since it is confusing. Any objections?

The timestamp logical types should have examples to clarify the semantics. No need to repeat those examples for each of -millis, -micros, and -nanos though.

I agree there, and I also like the examples. I've restructured the documentation to remove the duplicate sections.

rdblue · 2023-10-16T16:43:28Z

doc/content/en/docs/++version++/Specification/_index.md

@@ -872,6 +877,11 @@ The `local-timestamp-micros` logical type represents a timestamp in a local time

 A `local-timestamp-micros` logical type annotates an Avro `long`, where the long stores the number of microseconds, from 1 January 1970 00:00:00.000000.

+### Local timestamp (nanosecond precision)
+The `local-timestamp-nanos` logical type represents a timestamp in a local timezone, regardless of what specific time zone is considered local, with a precision of one nanosecond.


Is this related to the Iceberg work? I don't think that we would want to use this type for timestamptz_ns because we don't consider that a "local" timestamp.

No this is unrelated to Iceberg, but because we have this there is also a local-timestamp-{millis,micros} then I think people also expect the local equivalent.

KalleOlaviNiemitalo · 2023-10-16T18:48:55Z

The C# library translates the other timestamp logical types to the DateTime and DateTimeOffset types; however, those have a precision of 100 nanoseconds, so they wouldn't be able to represent all the timestamp-nanos and local-timestamp-nanos values exactly. I guess the library would then have to define new types for these purposes, with explicit conversion operators to and from DateTime. (The conversion to DateTime should be explicit because it can lose precision, and the conversion from DateTime should be explicit because it can overflow.)

doc/content/en/docs/++version++/Specification/_index.md

jacobmarble · 2023-10-16T22:03:18Z

I'm here to signal support this PR.

InfluxData would like to make our customers' InfluxDB data, stored in S3, directly accessible to customers via Apache Iceberg. However, we can't do this without (1) rewriting all customer data with microsecond timestamps or (2) updating the Apache Iceberg spec to allow for nanosecond-precision timestamps.

This change helps unblock the relevant change to Apache Iceberg. Thanks!

Co-authored-by: Jacob Marble <[email protected]>

martin-g

From the Slack discussion:

https://issues.apache.org/jira/browse/AVRO-3884 - empty description !
https://github.com/apache/avro/pull/2554 - empty description !
IMO spec changes should be discussed first on the mailing list and better explained !

Fokko · 2023-10-17T09:52:08Z

@martin-g I'm sorry, I thought it is self-explanatory. I sent out an email on the dev-list yesterday but also cross-posted on Slack today to get more eyes on it. I went ahead with the PR as this helps me personally with the discussion.

Co-authored-by: Jacob Marble <[email protected]>

zcsizmadia · 2023-10-18T13:58:48Z

IMO this requires a deeper discussion, than just adding it to the spec. The current time related types are tied to language specific time objects. Most architectures do not support the nanosecond resolution what this type would require. That needs to be investigated. What can be done in different architectures, OSs and languages.

I understand the need for ns measurements in applications, however those are ususally done by specific hardware/software support.

Without having deeper discussions, my gut feeling is that a nano time type should be only a a 64/128(?) bit long integer since the real world is a wild west when it comes to that precision. Ergo might not require a new type. Of course I am open to any discussions about this.

Fokko · 2023-10-20T13:24:36Z

@zcsizmadia Thanks for jumping in here. I've drafted a document: https://docs.google.com/document/d/10syT_J5ZoJ23wvzfYCaPTY7U89OmgS6ZmU7GeGWria8/edit

Feel free to comment on the document or reply on the devlist.

Without having deeper discussions, my gut feeling is that a nano time type should be only a a 64/128(?) bit long integer since the real world is a wild west when it comes to that precision. Ergo might not require a new type. Of course I am open to any discussions about this.

I fully agree here. The INT96 didn't really land, and Parquet now also uses INT64.

martin-g · 2023-10-20T13:31:45Z

Thank you for the PR description and the Google doc, @Fokko !

description is provided

martin-g · 2023-10-20T14:20:15Z

Does anyone know what is the idea behind having both local-timestamp-** and timestamp-** if the timezone is not encoded with the integer ?
Since the application should deal with the timezone then why do we need two types ?

Fokko · 2023-10-20T20:55:24Z

@martin-g Great question. The history originates from Hive where you have multiple Timezone types. They store the same information, but when writing the fields they behave differently. They have an excellent doc on this that explains the different types.

To make it even more confusing:

timestamp-** maps to TIMESTAMP WITH LOCAL TIME ZONE, or Java Instant
local-timestamp-** maps to TIMESTAMP WITHOUT TIME ZONE, or Java LocalDateTime

opwvhk · 2023-10-21T08:30:17Z

Does anyone know what is the idea behind having both local-timestamp-** and timestamp-** if the timezone is not encoded with the integer ? Since the application should deal with the timezone then why do we need two types ?

Sometimes, especially with legacy systems, the timezone is 'understood'. Usually because originally there was only one timezone (of the country the company operated in). This starts to break down of course when you go to an international setting, but rewriting is terribly expensive and often outside of any budget.

In such situations I may chose the local time(stamp) types as a warning signal.

martin-g · 2023-10-23T07:33:29Z

The problem is that in the Avro data format there is no timezone data in both types. It is left to the application to provide it, e.g. by storing it in a sibling field.

My question is: Do we need both types timestamp-xyz and local-timestamp-xyz ? It looks to me that it is up to the application logic to decide how to interpret the long/i64 value.

We are going to add one more pair of such types!

I understand that it is impossible to remove an existing type from the spec (v1)! And adding a new type should be consistent with the old ones! So there is no much that can be done at the moment but to acknowledge the situation.

The best would be the timestamp-xyz types to encode both the long and the timezone in one field. Then the Avro SDKs could provide help with deserializing it to language specific classes, e.g. OffsetDateTime in Java.

KalleOlaviNiemitalo · 2023-10-23T08:25:42Z

The best would be the timestamp-xyz types to encode both the long and the timezone in one field.

Would the field contain an offset from UTC, or a time zone identifier from the IANA tz database? For a future timestamp, a time zone identifier would allow the local time to be reconverted to UTC each time the government changes the rules of the time zone. If the type were used only for past timestamps, then an offset would suffice, I guess.

martin-g · 2023-10-23T08:37:44Z

Would the field contain an offset from UTC, or a time zone identifier from the IANA tz database?

I didn't think thoroughly about specifics! Mostly because I think it is impossible to do anything about this in v1 of the spec.

My point is that we are going to add more unnecessary pair to the current spec :-/
But if we are going to add a new type then it must be a pair, for consistency with the previous ones.

Fokko · 2023-10-23T12:35:08Z

The problem is that in the Avro data format, there is no timezone data in both types. It is left to the application to provide it, e.g. by storing it in a sibling field.

This is correct. The Hive doc also mentioned the TIMESTAMP WITH TIME ZONE that captures this behavior.

The best would be the timestamp-xyz types to encode both the long and the timezone in one field. Then the Avro SDKs could provide help with deserializing it to language-specific classes, e.g. OffsetDateTime in Java.

I don't agree about storing the timezone. I don't think that is something that should be done by Avro because it is a huge can of worms. In general, to maintain engineers' sanity, it is best to normalize everything to UTC when storing the data.

Having a way to store the timezone as well would require to:

Apply the timezone first before being able to do any comparison on it
Handle daylight savings? (Yes, we're in that time of year again).
Historical changes to the timestamps as @KalleOlaviNiemitalo already pointed out.

If we want this, this should be a separate proposal and would introduce new types because we can't alter the existing ones as you already mentioned.

My point is that we are going to add more unnecessary pair to the current spec :-/

Fair question, it would allow people who use existing local timestamps to migrate to nano precision.

martin-g · 2023-10-23T12:51:17Z

In general, to maintain engineers' sanity, it is best to normalize everything to UTC when storing the data.

But this is done in the application code, right ?
I just fail to see what is the benefit of having timestamp-x and local-timestamp-x in the context of Avro. Both are plain integers and if any normalization and calculations should be done then it is completely in the application code.

Fokko · 2023-10-23T13:09:12Z

This is a bit ambiguous indeed, you need to supply an Instant for the timestamp and a LocalDateTime for the local-timestamp. So in your application code, you need to make sure that you use the right objects. Having only timestamp would impose the responsibility of converting the LocalDateTime to an Instant on the developer. Having this distinction makes sure that a developer is less likely to store a Instant into a local-timestamp, without seeing anything off.

RyanSkraba · 2023-10-27T17:31:47Z

Hello! I just wanted to add some reference materials that might be useful to understanding where we are today!

The AVRO local-timestamp-x design documents are at https://issues.apache.org/jira/browse/AVRO-2328
This document (especially Appendix 2) gives some pretty compelling arguments for including the local-timestamp-x types.
It builds on another, more detailed document "Consistent timestamp types in Hadoop SQL engines" that has a wider scope across Big Data SQL engines.

(I refer to these document more than I care to admit...)

My subjective opinion is that I'd prefer to resist adding new LogicalTypes, but this is a pretty straightforward case. I really don't see any disadvantages to extending the existing timestamp types with more precision. Any SDK that can't or won't implement them will just have the nanos int64 to fall back on without any loss of precision.

kojiromike · 2023-11-08T00:48:05Z

Just a quick note: I pointed out in the google doc, but Python does have support for nanos in the stdlib.

Fokko · 2023-11-09T08:08:42Z

@kojiromike thank you for the addition, I've updated the doc. If you don't have any further concerns and are in favor of the spec change, please approve the PR :)

RyanSkraba

LGTM, thanks for the rewording. The spec is clearer with this change.

I'm very tempted to say that some of the documentation you've written in support of this PR could/should be on the website as supporting materials -- what do you think? You've put some effort into writing it up, but it could also be a good first issue for a new contributor!

doc/content/en/docs/++version++/Specification/_index.md

Fokko · 2023-12-07T12:26:45Z

Moving this forward, thanks everyone for the input 🙌

tjwp · 2023-12-29T00:56:28Z

doc/content/en/docs/++version++/Specification/_index.md


-A `local-timestamp-micros` logical type annotates an Avro `long`, where the long stores the number of microseconds, from 1 January 1970 00:00:00.000000.
+Example: Given an event at noon local time (12:00) on January 1, 2000, in Helsinki where the local time was two hours east of UTC (UTC+2). The timestamp is converted to Avro long 946684800000 (milliseconds) and then written.


@Fokko Is the example value 946684800000 correct? This corresponds to 2000-01-01 00:00:00 UTC.

I was expecting that the value would be 946728000000 which is 2000-01-01 12:00:00 UTC, i.e. 2000-01-01 12:00:00 +0200 converted directly to milliseconds without taking the timezone into account. But maybe I'm misinterpreting local timestamp concept.

@Fokko Ping!

@Fokko Is the example value 946684800000 correct? This corresponds to 2000-01-01 00:00:00 UTC.

Yes, you're right. It should be 946728000000.

The concept of the local-timestamp is each value is a recording of what can be seen on a calendar and a clock hanging on the wall, for example "1969-07-20 16:17:39". It can be decomposed into year, month, day, hour, minute and seconds fields, but with no time zone information available, it does not correspond to any specific point in time. It is often used in legacy systems.

Thanks @martin-g for pinging me. I was off that week, and the notification must be somewhere deep down in my mailbox.

…2554) * AVRO-3884: Add `local-timestamp-nanos` and `timestamp-nanos` * Add zeros Co-authored-by: Jacob Marble <[email protected]> * Update precision Co-authored-by: Jacob Marble <[email protected]> * Combine the datetimes and rework the wording * Remove then * Update doc/content/en/docs/++version++/Specification/_index.md --------- Co-authored-by: Jacob Marble <[email protected]>

AVRO-3884: Add local-timestamp-nanos and timestamp-nanos

725a5e3

github-actions bot added the website label Oct 16, 2023

Fokko mentioned this pull request Oct 16, 2023

Spec: add nanosecond timestamp types apache/iceberg#8683

Merged

rdblue reviewed Oct 16, 2023

View reviewed changes

jacobmarble reviewed Oct 16, 2023

View reviewed changes

doc/content/en/docs/++version++/Specification/_index.md Outdated Show resolved Hide resolved

jacobmarble reviewed Oct 16, 2023

View reviewed changes

doc/content/en/docs/++version++/Specification/_index.md Outdated Show resolved Hide resolved

Add zeros

1c66734

Co-authored-by: Jacob Marble <[email protected]>

martin-g previously requested changes Oct 17, 2023

View reviewed changes

Fokko and others added 2 commits October 17, 2023 12:31

Update precision

815fbf4

Co-authored-by: Jacob Marble <[email protected]>

Combine the datetimes and rework the wording

8b82ebb

Fokko force-pushed the fd-add-nanos branch from ac29b87 to 8b82ebb Compare October 17, 2023 16:54

RyanSkraba approved these changes Nov 9, 2023

View reviewed changes

martin-g approved these changes Nov 10, 2023

View reviewed changes

KalleOlaviNiemitalo suggested changes Nov 10, 2023

View reviewed changes

doc/content/en/docs/++version++/Specification/_index.md Outdated Show resolved Hide resolved

Remove then

76beda2

KalleOlaviNiemitalo approved these changes Nov 10, 2023

View reviewed changes

doc/content/en/docs/++version++/Specification/_index.md Outdated Show resolved Hide resolved

Update doc/content/en/docs/++version++/Specification/_index.md

3b9d95b

Fokko mentioned this pull request Dec 5, 2023

AVRO-3914: Add nanos support for the Java SDK #2608

Merged

Fokko merged commit c3c41fb into apache:main Dec 7, 2023
2 checks passed

Fokko deleted the fd-add-nanos branch December 7, 2023 12:26

tjwp reviewed Dec 29, 2023

View reviewed changes

martin-g mentioned this pull request Aug 16, 2024

AVRO-4037: [C++] Add local-timestamp-millis, local-timestamp-micros logical types #3053

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AVRO-3884: Add `local-timestamp-nanos` and `timestamp-nanos` #2554

AVRO-3884: Add `local-timestamp-nanos` and `timestamp-nanos` #2554

Fokko commented Oct 16, 2023 •

edited

Loading

rdblue Oct 16, 2023

Fokko Oct 16, 2023

opwvhk Oct 17, 2023

KalleOlaviNiemitalo Oct 17, 2023

Fokko Oct 17, 2023

rdblue Oct 16, 2023

Fokko Oct 16, 2023

KalleOlaviNiemitalo commented Oct 16, 2023

jacobmarble commented Oct 16, 2023

martin-g left a comment

Fokko commented Oct 17, 2023

zcsizmadia commented Oct 18, 2023

Fokko commented Oct 20, 2023 •

edited

Loading

martin-g commented Oct 20, 2023

martin-g commented Oct 20, 2023

Fokko commented Oct 20, 2023

opwvhk commented Oct 21, 2023

martin-g commented Oct 23, 2023

KalleOlaviNiemitalo commented Oct 23, 2023

martin-g commented Oct 23, 2023

Fokko commented Oct 23, 2023

martin-g commented Oct 23, 2023

Fokko commented Oct 23, 2023

RyanSkraba commented Oct 27, 2023 •

edited

Loading

kojiromike commented Nov 8, 2023

Fokko commented Nov 9, 2023

RyanSkraba left a comment

Fokko commented Dec 7, 2023

tjwp Dec 29, 2023

martin-g Feb 13, 2024

Fokko Feb 13, 2024


		A `local-timestamp-micros` logical type annotates an Avro `long`, where the long stores the number of microseconds, from 1 January 1970 00:00:00.000000.
		Example: Given an event at noon local time (12:00) on January 1, 2000, in Helsinki where the local time was two hours east of UTC (UTC+2). The timestamp is converted to Avro long 946684800000 (milliseconds) and then written.

AVRO-3884: Add local-timestamp-nanos and timestamp-nanos #2554

AVRO-3884: Add local-timestamp-nanos and timestamp-nanos #2554

Conversation

Fokko commented Oct 16, 2023 • edited Loading

What is the purpose of the change

Languages

Primitive type consideration

Verifying this change

Documentation

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KalleOlaviNiemitalo commented Oct 16, 2023

jacobmarble commented Oct 16, 2023

martin-g left a comment

Choose a reason for hiding this comment

Fokko commented Oct 17, 2023

zcsizmadia commented Oct 18, 2023

Fokko commented Oct 20, 2023 • edited Loading

martin-g commented Oct 20, 2023

martin-g commented Oct 20, 2023

Fokko commented Oct 20, 2023

opwvhk commented Oct 21, 2023

martin-g commented Oct 23, 2023

KalleOlaviNiemitalo commented Oct 23, 2023

martin-g commented Oct 23, 2023

Fokko commented Oct 23, 2023

martin-g commented Oct 23, 2023

Fokko commented Oct 23, 2023

RyanSkraba commented Oct 27, 2023 • edited Loading

kojiromike commented Nov 8, 2023

Fokko commented Nov 9, 2023

RyanSkraba left a comment

Choose a reason for hiding this comment

Fokko commented Dec 7, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AVRO-3884: Add `local-timestamp-nanos` and `timestamp-nanos` #2554

AVRO-3884: Add `local-timestamp-nanos` and `timestamp-nanos` #2554

Fokko commented Oct 16, 2023 •

edited

Loading

Fokko commented Oct 20, 2023 •

edited

Loading

RyanSkraba commented Oct 27, 2023 •

edited

Loading