-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add timestamp to date schema evolution for hive #19513
Add timestamp to date schema evolution for hive #19513
Conversation
432cf34
to
bc4b9e7
Compare
bc4b9e7
to
a4aebaa
Compare
The PR description seems off:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only skimmed but some comments
plugin/trino-hive/src/main/java/io/trino/plugin/hive/coercions/CoercionUtils.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/coercions/TimestampCoercer.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/orc/OrcTypeTranslator.java
Show resolved
Hide resolved
plugin/trino-hive/src/test/java/io/trino/plugin/hive/coercions/TestTimestampCoercer.java
Outdated
Show resolved
Hide resolved
a4aebaa
to
f22b170
Compare
@dain Addressed comments |
f22b170
to
62aa404
Compare
13b7675
to
0d4396b
Compare
plugin/trino-hive/src/test/java/io/trino/plugin/hive/coercions/TestTimestampCoercer.java
Outdated
Show resolved
Hide resolved
43b8dfe
to
0e486f4
Compare
Unrelated CI failure #19446 |
return Optional.of(new LongTimestampToVarcharCoercer(TIMESTAMP_NANOS, varcharType)); | ||
} | ||
if (toType instanceof DateType toDateType) { | ||
return Optional.of(new LongTimestampToDateCoercer(TIMESTAMP_NANOS, toDateType)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems incorrect. The from type is a timestamp type, but we are passing nanos to LongTimestampToDateCoercer
. That code will call getObject
on the type to get LongTimestamp
. The probem is that method will only work if the Block type is Fixed12Block, but if the input data is millis, the block will be a LongArrayBlock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually when Hive tries to tries to coerces a timestamp column to a varchar or date - we forcefully convert the read type to TIMESTAMP_NANOS
- Ad the underling block would be of Fixed128Block
always.
Hive doesn't have the concept of parameterized timestamp (atleast in Hive 3) so all the column would be read as TIMESTAMP(9)
so we handle for coercion accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For reference to the above answer:
trino/plugin/trino-hive/src/main/java/io/trino/plugin/hive/coercions/CoercionUtils.java
Lines 84 to 89 in 36895bb
public static Type createTypeFromCoercer(TypeManager typeManager, HiveType fromHiveType, HiveType toHiveType, CoercionContext coercionContext) | |
{ | |
return createCoercer(typeManager, fromHiveType, toHiveType, coercionContext) | |
.map(TypeCoercer::getFromType) | |
.orElseGet(() -> fromHiveType.getType(typeManager, coercionContext.timestampPrecision())); | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested extensively manually with combinations of :
2021-01-01 00:00:00
2021-01-02 00:00:00.123
2021-01-03 00:00:00.123456
2021-01-04 00:00:00.123456789
and each time we read timestamp(9)
which ends up being stored in Fixed12Block
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why "we forcefully convert the read type to TIMESTAMP_NANOS" if we are going to just convert back to millis or micros (depending on the configuration of the catalog)? Seems pretty wasteful.
Add a comment explaining this behavior and maybe add a pointer to the code that changes the type to timestamp(9)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we forcefully convert the read type to TIMESTAMP_NANOS
We are reading the column with TIMESTAMP_NANOS
only if some coercion is applied on top of it - During all the other case we read it as per the configuration. We could implement another coercer for short timestamp - but now we have support reading timestamp column with NANOS precision are we going to deprecate HiveTimestampPrecision
in the future ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why "we forcefully convert the read type to TIMESTAMP_NANOS" if we are going to just convert back to millis ?
Probably because we don't want to suffer from side effects of rounding 2023-12-31 23:59:59.999999999
to 2024-01-01 00:00:00.000
trino/lib/trino-orc/src/main/java/io/trino/orc/reader/TimestampColumnReader.java
Lines 399 to 406 in 5d167d4
if (nanos != 0) { | |
// adjust for bad ORC rounding logic | |
if (millis < 0) { | |
millis -= MILLIS_PER_SECOND; | |
} | |
millis += roundDiv(nanos, NANOSECONDS_PER_MILLISECOND); | |
} |
Can you point to the Hive code that handles this? I don't see a coercion for this in the normal place in Hive |
The PR description still needs an update |
0e486f4
to
607dad0
Compare
Rebased on |
plugin/trino-hive/src/main/java/io/trino/plugin/hive/coercions/CoercionUtils.java
Show resolved
Hide resolved
I'm still concerned that this is the same behavior as Hive. We can choose to do more than Hive, but that takes thought and discussion. If we can show this is what Hive does, the decission is easy |
@dain The coercion information are captured in this class - https://github.com/apache/hive/blob/branch-3.1/serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/primitive/PrimitiveObjectInspectorUtils.java. In specific for types which can be coerced into date are mentioned here - https://github.com/apache/hive/blob/af7059e2bdc8b18af42e0b7f7163b923a0bfd424/serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/primitive/PrimitiveObjectInspectorUtils.java#L1111 These rules are summarized in ORC code base - Ref - https://github.com/apache/orc/blob/branch-1.6/java/core/src/java/org/apache/orc/impl/ConvertTreeReaderFactory.java
|
I still have to verify a bit how is the coercion of timestamp values is happening in Hive & Trino:
UPDATE: Manual tests are successful.Covered in the product tests as well. |
607dad0
to
ef45d35
Compare
Thank you to all the reviewers for effort put so far on this PR. AC @dain and @Praveen2112 |
ef45d35
to
2bf4f5f
Compare
@Praveen2112 🤯. I didn't know there were so may supported coercions. The link I posted I found from some issue in Hive about allowed changes with alter table change data type. Let's make sure the umbrella issue has all of these. |
@dain I'll be updating them in the umbrella task. |
@findinpath Thanks for working on this !! |
Description
Add schema evolution from
timestamp
todate
column types for hive tables.Enable the Hive users to retrieve the information from their tables after performing queries which change
timestamp
columns todate
.Hive sample DDL query for changing the type of a
timestamp
column towardsdate
:This change enables Hive users to read the column data as
date
type even if the underlying type of the data stored in the column from the storage istimestamp
.Additional context and related issues
This is a mere adaptation of #16869 to match
timestamp
todate
coercionRelease notes
( ) This is not user-visible or is docs only, and no release notes are required.
() Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text: