Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core: Fix numeric overflow of timestamp nano literal #11775

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -300,8 +300,7 @@ public <T> Literal<T> to(Type type) {
case TIMESTAMP:
return (Literal<T>) new TimestampLiteral(value());
case TIMESTAMP_NANO:
// assume micros and convert to nanos to match the behavior in the timestamp case above
return new TimestampLiteral(value()).to(type);
return (Literal<T>) new TimestampNanoLiteral(value());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change seems correct to me, the previous behavior for this was to assume the value was in microseconds and then pass that through to TimestampLiteral but that can overflow and does not actually represent a nanosecond timestamp!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW I still think this is correct but it's worth getting other's perspective on this since we are changing one of the assumptions of how value is interpreted when the type to convert to is nanoseconds.

CC @nastra @epgif @jacobmarble @rdblue thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think both before and after this change is correct. In Iceberg we had the assumption that everything is in microseconds. But this doesn't hold anymore now we have nano's. I do think the version after the change is more correct and more closely aligns with my expectations. If we can make sure that folks are not using this yet, I think this change is a good one 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a chat with @rdblue who reviewed the PR that introduced this, and it is actually on purpose. Spark always passes in microseconds, changing this would break this assumption with Spark. So I think we have to revert this line. That said, I do think we need to check (and raise an error) when it overflows. Easiest way of doing this is by converting it to nano's, and convert is back to micro's and check if it still the same value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for sharing the context. What about other query engines? Actually, I found this issue when I was trying to support nanosecond precision in Trino Iceberg connector. As you may know, the max precision in Trino is picos (12).

case DATE:
if ((long) Integer.MAX_VALUE < value()) {
return aboveMax();
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@
import static org.assertj.core.api.Assertions.assertThat;
import static org.assertj.core.api.Assertions.assertThatThrownBy;

import java.time.LocalDate;
import java.time.format.DateTimeParseException;
import org.apache.iceberg.types.Types;
import org.apache.iceberg.util.DateTimeUtil;
Expand Down Expand Up @@ -107,6 +108,33 @@ public void testTimestampMicrosToDateConversion() {
assertThat(dateOrdinal).isEqualTo(-1);
}

@Test
public void testTimestampNanoWithLongLiteral() {
amogh-jahagirdar marked this conversation as resolved.
Show resolved Hide resolved
// verify round-trip between timestamp_ns and long
Literal<Long> timestampNano =
Literal.of("2017-11-16T14:31:08.000000001").to(Types.TimestampNanoType.withoutZone());
assertThat(timestampNano.value()).isEqualTo(1510842668000000001L);

Literal<Long> longLiteral =
Literal.of(1510842668000000001L).to(Types.TimestampNanoType.withoutZone());
assertThat(longLiteral).isEqualTo(timestampNano);

// cast long literal to temporal types
assertThat(longLiteral.to(Types.DateType.get()).value())
.isEqualTo((int) LocalDate.of(2017, 11, 16).toEpochDay());

assertThat(longLiteral.to(Types.TimestampType.withoutZone()).value())
.isEqualTo(1510842668000000L);

assertThat(longLiteral.to(Types.TimestampType.withZone()).value()).isEqualTo(1510842668000000L);

assertThat(longLiteral.to(Types.TimestampNanoType.withoutZone()).value())
.isEqualTo(1510842668000000001L);

assertThat(longLiteral.to(Types.TimestampNanoType.withZone()).value())
.isEqualTo(1510842668000000001L);
}

@Test
public void testTimestampNanoToTimestampConversion() {
Literal<Long> timestamp =
Expand Down Expand Up @@ -158,6 +186,33 @@ public void testTimestampNanosToDateConversion() {
assertThat(dateOrdinal).isEqualTo(-1);
}

@Test
public void testTimestampNanoWithZoneWithLongLiteral() {
// verify round-trip between timestamptz_ns and long
Literal<Long> timestampNanoWithZone =
Literal.of("2017-11-16T14:31:08.000000001+01:00").to(Types.TimestampNanoType.withZone());
assertThat(timestampNanoWithZone.value()).isEqualTo(1510839068000000001L);

Literal<Long> longLiteral =
Literal.of(1510839068000000001L).to(Types.TimestampNanoType.withZone());
assertThat(longLiteral).isEqualTo(timestampNanoWithZone);

// cast long literal to temporal types
assertThat(longLiteral.to(Types.DateType.get()).value())
.isEqualTo((int) LocalDate.of(2017, 11, 16).toEpochDay());

assertThat(longLiteral.to(Types.TimestampType.withoutZone()).value())
.isEqualTo(1510839068000000L);

assertThat(longLiteral.to(Types.TimestampType.withZone()).value()).isEqualTo(1510839068000000L);

assertThat(longLiteral.to(Types.TimestampNanoType.withoutZone()).value())
.isEqualTo(1510839068000000001L);

assertThat(longLiteral.to(Types.TimestampNanoType.withoutZone()).value())
.isEqualTo(1510839068000000001L);
}

@Test
public void testTimestampNanosWithZoneConversion() {
Literal<CharSequence> isoTimestampNanosWithZoneOffset =
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -111,9 +111,9 @@ public void testByteBufferConversions() {
assertConversion(
400000000L, TimestampNanoType.withZone(), new byte[] {0, -124, -41, 23, 0, 0, 0, 0});
assertThat(Literal.of(400000L).to(TimestampNanoType.withoutZone()).toByteBuffer().array())
.isEqualTo(new byte[] {0, -124, -41, 23, 0, 0, 0, 0});
.isEqualTo(new byte[] {-128, 26, 6, 0, 0, 0, 0, 0});
assertThat(Literal.of(400000L).to(TimestampNanoType.withZone()).toByteBuffer().array())
Comment on lines 111 to 113
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused how the original assertion was passing? Shouldn't have this always been equivalent to {-128, 26, 6, 0, 0, 0, 0, 0}?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the cause is the original logic called DateTimeUtil.microsToNanos method which multiples the value by 1000:

case TIMESTAMP_NANO:
return (Literal<T>) new TimestampNanoLiteral(DateTimeUtil.microsToNanos(value()));

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see the comment on line 107/108. Could we update the assertConversion to instead test against 400000L and then remove the comment. At this point we are no longer having to pass in different values since we Literal.of(someLong).to(TimestampNanos) will always interpret someLong as nanoseconds.

.isEqualTo(new byte[] {0, -124, -41, 23, 0, 0, 0, 0});
.isEqualTo(new byte[] {-128, 26, 6, 0, 0, 0, 0, 0});

// strings are stored as UTF-8 bytes (without length)
// 'A' -> 65, 'B' -> 66, 'C' -> 67
Expand Down