-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AVRO-3884: Add local-timestamp-nanos
and timestamp-nanos
#2554
Conversation
@@ -862,6 +862,11 @@ The `timestamp-micros` logical type represents an instant on the global timeline | |||
|
|||
A `timestamp-micros` logical type annotates an Avro `long`, where the long stores the number of microseconds from the unix epoch, 1 January 1970 00:00:00.000000 UTC. | |||
|
|||
### Timestamp (nanosecond precision) | |||
The `timestamp-nanos` logical type represents an instant on the global timeline, independent of a particular time zone or calendar, with a precision of one nanosecond. Please note that time zone information gets lost in this process. Upon reading a value back, we can only reconstruct the instant, but not the original representation. In practice, such timestamps are typically displayed to users in their local time zones, therefore they may be displayed differently depending on the execution environment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's accurate to say that "time zone information gets lost in this process" because the type is independent of a zone. I also would not refer to "the instant". Assuming that this logical type corresponds to TIMESTAMP(9) WITHOUT TIME ZONE
, I would say that any statement should be the displayed value must never be modified with respect to the system time zone because it has no time zone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, and I think it can be confusing to refer to a timezone at all. I copied this both from millis and micros, do we want to deviate from that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I do not agree with the statement that "time zone information gets lost in this process": the paragraph below explicitly states the time zone in use.
This is different from the logical type local-timestamp-nanos
below, that does not have time zone information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The timestamp logical types should have examples to clarify the semantics. No need to repeat those examples for each of -millis, -micros, and -nanos though.
Given an event at noon local time on January 1, 2000, in Helsinki where the local time was two hours east of UTC:
- For timestamp-millis, the timestamp is converted to UTC 2000-01-01T10:00:00 and that is then converted to Avro long (fill in the number).
- For local-timestamp-millis, the timestamp is kept in local time 2000-01-01T12:00:00 and that is then converted to Avro long (fill in the number).
In either case, the schema author may add a separate field for the time zone offset (+02:00) or a time zone identifier (Europe/Helsinki), or the recipient of the data may know these via some offband agreement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks everyone for the great input here. I think we all agree that this needs some reworking.
Actually, I do not agree with the statement that "time zone information gets lost in this process": the paragraph below explicitly states the time zone in use.
The timezone is always UTC. But the local timezone that the writer lives in, is lost. I would suggest removing this sentence since it is confusing. Any objections?
The timestamp logical types should have examples to clarify the semantics. No need to repeat those examples for each of -millis, -micros, and -nanos though.
I agree there, and I also like the examples. I've restructured the documentation to remove the duplicate sections.
@@ -872,6 +877,11 @@ The `local-timestamp-micros` logical type represents a timestamp in a local time | |||
|
|||
A `local-timestamp-micros` logical type annotates an Avro `long`, where the long stores the number of microseconds, from 1 January 1970 00:00:00.000000. | |||
|
|||
### Local timestamp (nanosecond precision) | |||
The `local-timestamp-nanos` logical type represents a timestamp in a local timezone, regardless of what specific time zone is considered local, with a precision of one nanosecond. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this related to the Iceberg work? I don't think that we would want to use this type for timestamptz_ns
because we don't consider that a "local" timestamp.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No this is unrelated to Iceberg, but because we have this there is also a local-timestamp-{millis,micros}
then I think people also expect the local equivalent.
The C# library translates the other timestamp logical types to the DateTime and DateTimeOffset types; however, those have a precision of 100 nanoseconds, so they wouldn't be able to represent all the |
I'm here to signal support this PR. InfluxData would like to make our customers' InfluxDB data, stored in S3, directly accessible to customers via Apache Iceberg. However, we can't do this without (1) rewriting all customer data with microsecond timestamps or (2) updating the Apache Iceberg spec to allow for nanosecond-precision timestamps. This change helps unblock the relevant change to Apache Iceberg. Thanks! |
Co-authored-by: Jacob Marble <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the Slack discussion:
https://issues.apache.org/jira/browse/AVRO-3884 - empty description !
https://github.com/apache/avro/pull/2554 - empty description !
IMO spec changes should be discussed first on the mailing list and better explained !
@martin-g I'm sorry, I thought it is self-explanatory. I sent out an email on the dev-list yesterday but also cross-posted on Slack today to get more eyes on it. I went ahead with the PR as this helps me personally with the discussion. |
Co-authored-by: Jacob Marble <[email protected]>
IMO this requires a deeper discussion, than just adding it to the spec. The current time related types are tied to language specific time objects. Most architectures do not support the nanosecond resolution what this type would require. That needs to be investigated. What can be done in different architectures, OSs and languages. I understand the need for ns measurements in applications, however those are ususally done by specific hardware/software support. Without having deeper discussions, my gut feeling is that a nano time type should be only a a 64/128(?) bit long integer since the real world is a wild west when it comes to that precision. Ergo might not require a new type. Of course I am open to any discussions about this. |
@zcsizmadia Thanks for jumping in here. I've drafted a document: https://docs.google.com/document/d/10syT_J5ZoJ23wvzfYCaPTY7U89OmgS6ZmU7GeGWria8/edit Feel free to comment on the document or reply on the devlist.
I fully agree here. The INT96 didn't really land, and Parquet now also uses INT64. |
Thank you for the PR description and the Google doc, @Fokko ! |
Does anyone know what is the idea behind having both |
@martin-g Great question. The history originates from Hive where you have multiple Timezone types. They store the same information, but when writing the fields they behave differently. They have an excellent doc on this that explains the different types. To make it even more confusing:
|
Sometimes, especially with legacy systems, the timezone is 'understood'. Usually because originally there was only one timezone (of the country the company operated in). This starts to break down of course when you go to an international setting, but rewriting is terribly expensive and often outside of any budget. In such situations I may chose the local time(stamp) types as a warning signal. |
The problem is that in the Avro data format there is no timezone data in both types. It is left to the application to provide it, e.g. by storing it in a sibling field. My question is: Do we need both types We are going to add one more pair of such types! I understand that it is impossible to remove an existing type from the spec (v1)! And adding a new type should be consistent with the old ones! So there is no much that can be done at the moment but to acknowledge the situation. The best would be the |
Would the field contain an offset from UTC, or a time zone identifier from the IANA tz database? For a future timestamp, a time zone identifier would allow the local time to be reconverted to UTC each time the government changes the rules of the time zone. If the type were used only for past timestamps, then an offset would suffice, I guess. |
I didn't think thoroughly about specifics! Mostly because I think it is impossible to do anything about this in v1 of the spec. My point is that we are going to add more unnecessary pair to the current spec :-/ |
This is correct. The Hive doc also mentioned the TIMESTAMP WITH TIME ZONE that captures this behavior.
I don't agree about storing the timezone. I don't think that is something that should be done by Avro because it is a huge can of worms. In general, to maintain engineers' sanity, it is best to normalize everything to UTC when storing the data. Having a way to store the timezone as well would require to:
If we want this, this should be a separate proposal and would introduce new types because we can't alter the existing ones as you already mentioned.
Fair question, it would allow people who use existing |
But this is done in the application code, right ? |
This is a bit ambiguous indeed, you need to supply an |
Hello! I just wanted to add some reference materials that might be useful to understanding where we are today!
(I refer to these document more than I care to admit...) My subjective opinion is that I'd prefer to resist adding new LogicalTypes, but this is a pretty straightforward case. I really don't see any disadvantages to extending the existing timestamp types with more precision. Any SDK that can't or won't implement them will just have the nanos |
Just a quick note: I pointed out in the google doc, but Python does have support for nanos in the stdlib. |
@kojiromike thank you for the addition, I've updated the doc. If you don't have any further concerns and are in favor of the spec change, please approve the PR :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for the rewording. The spec is clearer with this change.
I'm very tempted to say that some of the documentation you've written in support of this PR could/should be on the website as supporting materials -- what do you think? You've put some effort into writing it up, but it could also be a good first issue for a new contributor!
Moving this forward, thanks everyone for the input 🙌 |
|
||
A `local-timestamp-micros` logical type annotates an Avro `long`, where the long stores the number of microseconds, from 1 January 1970 00:00:00.000000. | ||
Example: Given an event at noon local time (12:00) on January 1, 2000, in Helsinki where the local time was two hours east of UTC (UTC+2). The timestamp is converted to Avro long 946684800000 (milliseconds) and then written. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Fokko Is the example value 946684800000
correct? This corresponds to 2000-01-01 00:00:00 UTC
.
I was expecting that the value would be 946728000000
which is 2000-01-01 12:00:00 UTC
, i.e. 2000-01-01 12:00:00 +0200
converted directly to milliseconds without taking the timezone into account. But maybe I'm misinterpreting local timestamp concept.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Fokko Ping!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Fokko Is the example value 946684800000 correct? This corresponds to 2000-01-01 00:00:00 UTC.
Yes, you're right. It should be 946728000000.
The concept of the local-timestamp is each value is a recording of what can be seen on a calendar and a clock hanging on the wall, for example "1969-07-20 16:17:39". It can be decomposed into year, month, day, hour, minute and seconds fields, but with no time zone information available, it does not correspond to any specific point in time. It is often used in legacy systems.
Thanks @martin-g for pinging me. I was off that week, and the notification must be somewhere deep down in my mailbox.
…2554) * AVRO-3884: Add `local-timestamp-nanos` and `timestamp-nanos` * Add zeros Co-authored-by: Jacob Marble <[email protected]> * Update precision Co-authored-by: Jacob Marble <[email protected]> * Combine the datetimes and rework the wording * Remove then * Update doc/content/en/docs/++version++/Specification/_index.md --------- Co-authored-by: Jacob Marble <[email protected]>
AVRO-3884
What is the purpose of the change
Within certain industries nano timestamps are the common practice (finance for example), therefore I would propose adding this to the Avro spec as well.
Nanosecond datetime precision is needed in various fields and applications for several reasons, particularly in scenarios where extremely precise timing is critical. Here are a few reasons why nanosecond datetime precision is essential:
Financial Transactions: In the world of high-frequency trading and financial markets, nanosecond precision is necessary to accurately record and timestamp transactions. Small time differentials can result in significant advantages or losses in trading, making precise timestamps crucial for maintaining fairness and integrity in financial markets.
Scientific Research: Many scientific experiments and measurements require extremely high precision in time recording. Fields like particle physics, astronomy, and chemistry may deal with processes that occur at the nanosecond level, where the exact timing of events is vital for accurate data analysis.
Telecommunications: In telecommunications, especially for systems operating at high frequencies, nanosecond precision is required to synchronize various network components and ensure the efficient and reliable transfer of data.
GPS and Navigation: Global Positioning System (GPS) technology relies on nanosecond precision to calculate the time it takes for signals to travel from satellites to receivers. Accurate timekeeping is crucial for determining precise locations and distances.
Aerospace and Defense: In aviation and defense applications, nanosecond precision is needed for tasks such as navigation, missile guidance, and radar systems, where small timing errors can have serious consequences.
Data Centers and Distributed Systems: Modern data centers and distributed systems often require nanosecond precision for synchronization and coordination to ensure efficient and reliable operations.
Simulation and Modeling: Computer simulations and modeling in various fields, including engineering, climate science, and fluid dynamics, may need nanosecond-level precision to accurately represent real-world processes and interactions.
Network Protocols: Some network protocols and technologies, such as Ethernet, use nanosecond precision to time-stamp packets for accurate sequencing and synchronization in communication networks.
Cryptography and Security: In cryptographic applications, precise timing can be crucial for ensuring secure authentication and encryption processes, and nanosecond precision can help protect against various timing-based attacks.
High-Performance Computing: Supercomputers and other high-performance computing systems require nanosecond precision for tasks like benchmarking, profiling, and optimizing code to improve overall performance.
In these and other applications, nanosecond datetime precision is essential for ensuring accurate data recording, synchronization, and the reliable functioning of systems and processes that rely on precise timing.
Languages
Instant.ofEpochSecond
.i128
usingfrom_nanos
.int64
on 64-bit systems, or a float on 32 bit systems. For languages like this, we could just return the integer value (depending on the physical type).Primitive type consideration
int64
: Widely available, but limited precision: [1677-09-21 00:12:43.145224193, 2262-04-11 23:47:16.854775807].int128
: Wide range, but might require external libraries, such as GMP for C++, or performance trade-offs because it needsBigInteger
in Java (as opposed to primitives).decimal(n, 0)
: Arbitrary range, but has the same issue asint128
.I'm leaning towards starting with
int64
(and maybeint128
). We could always promoteint64 → int128
(and that's also binary compatible thanks to the zigzag encoding).Verifying this change
(Please pick one of the following options)
This change is a trivial rework / code cleanup without any test coverage.
(or)
This change is already covered by existing tests, such as (please describe tests).
(or)
This change added tests and can be verified as follows:
(example:)
Documentation