Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add timestamp to date schema evolution for hive #19513

Conversation

findinpath
Copy link
Contributor

@findinpath findinpath commented Oct 24, 2023

Description

Add schema evolution from timestamp to date column types for hive tables.

Enable the Hive users to retrieve the information from their tables after performing queries which change timestamp columns to date.

Hive sample DDL query for changing the type of a timestamp column towards date:

ALTER TABLE mytable CHANGE COLUMN mycolumn mycolumn date;

This change enables Hive users to read the column data as date type even if the underlying type of the data stored in the column from the storage is timestamp .

Additional context and related issues

This is a mere adaptation of #16869 to match timestamp to date coercion

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
() Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# Hive
* Add timestamp to date schema evolution for hive. ({issue}`issuenumber`)

@cla-bot cla-bot bot added the cla-signed label Oct 24, 2023
@findinpath findinpath self-assigned this Oct 24, 2023
@findinpath findinpath added release-notes hive Hive connector labels Oct 24, 2023
@findinpath findinpath force-pushed the findinpath/hive-timestamp-to-date-coercion branch from 432cf34 to bc4b9e7 Compare October 24, 2023 18:48
@findinpath findinpath marked this pull request as ready for review October 25, 2023 06:56
@findinpath findinpath requested a review from ebyhr October 25, 2023 12:46
@Praveen2112 Praveen2112 force-pushed the findinpath/hive-timestamp-to-date-coercion branch from bc4b9e7 to a4aebaa Compare October 30, 2023 14:13
@dain
Copy link
Member

dain commented Oct 30, 2023

The PR description seems off:

  • Shouldn't we include an example Trino statement?
  • The release notes say boolean to varchar
  • The release notes should talk about "schema evolution" as "coercion" is generic and users could interpret that as "implicit cast".

Copy link
Member

@dain dain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only skimmed but some comments

@Praveen2112 Praveen2112 force-pushed the findinpath/hive-timestamp-to-date-coercion branch from a4aebaa to f22b170 Compare November 3, 2023 11:53
@Praveen2112
Copy link
Member

@dain Addressed comments

@Praveen2112 Praveen2112 force-pushed the findinpath/hive-timestamp-to-date-coercion branch from f22b170 to 62aa404 Compare November 6, 2023 06:25
@findinpath findinpath changed the title Add timestamp to date coercion for hive Add timestamp to date schema evolution for hive Nov 6, 2023
@findinpath findinpath requested a review from dain November 6, 2023 16:07
@findinpath findinpath force-pushed the findinpath/hive-timestamp-to-date-coercion branch 3 times, most recently from 13b7675 to 0d4396b Compare November 6, 2023 18:23
@findinpath findinpath force-pushed the findinpath/hive-timestamp-to-date-coercion branch from 43b8dfe to 0e486f4 Compare November 7, 2023 08:04
@findinpath
Copy link
Contributor Author

Unrelated CI failure #19446

return Optional.of(new LongTimestampToVarcharCoercer(TIMESTAMP_NANOS, varcharType));
}
if (toType instanceof DateType toDateType) {
return Optional.of(new LongTimestampToDateCoercer(TIMESTAMP_NANOS, toDateType));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems incorrect. The from type is a timestamp type, but we are passing nanos to LongTimestampToDateCoercer. That code will call getObject on the type to get LongTimestamp. The probem is that method will only work if the Block type is Fixed12Block, but if the input data is millis, the block will be a LongArrayBlock.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually when Hive tries to tries to coerces a timestamp column to a varchar or date - we forcefully convert the read type to TIMESTAMP_NANOS - Ad the underling block would be of Fixed128Block always.

Hive doesn't have the concept of parameterized timestamp (atleast in Hive 3) so all the column would be read as TIMESTAMP(9)so we handle for coercion accordingly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reference to the above answer:

public static Type createTypeFromCoercer(TypeManager typeManager, HiveType fromHiveType, HiveType toHiveType, CoercionContext coercionContext)
{
return createCoercer(typeManager, fromHiveType, toHiveType, coercionContext)
.map(TypeCoercer::getFromType)
.orElseGet(() -> fromHiveType.getType(typeManager, coercionContext.timestampPrecision()));
}

Copy link
Contributor Author

@findinpath findinpath Nov 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested extensively manually with combinations of :

  • 2021-01-01 00:00:00
  • 2021-01-02 00:00:00.123
  • 2021-01-03 00:00:00.123456
  • 2021-01-04 00:00:00.123456789

and each time we read timestamp(9) which ends up being stored in Fixed12Block

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why "we forcefully convert the read type to TIMESTAMP_NANOS" if we are going to just convert back to millis or micros (depending on the configuration of the catalog)? Seems pretty wasteful.

Add a comment explaining this behavior and maybe add a pointer to the code that changes the type to timestamp(9)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we forcefully convert the read type to TIMESTAMP_NANOS

We are reading the column with TIMESTAMP_NANOS only if some coercion is applied on top of it - During all the other case we read it as per the configuration. We could implement another coercer for short timestamp - but now we have support reading timestamp column with NANOS precision are we going to deprecate HiveTimestampPrecision in the future ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why "we forcefully convert the read type to TIMESTAMP_NANOS" if we are going to just convert back to millis ?

Probably because we don't want to suffer from side effects of rounding 2023-12-31 23:59:59.999999999 to 2024-01-01 00:00:00.000

if (nanos != 0) {
// adjust for bad ORC rounding logic
if (millis < 0) {
millis -= MILLIS_PER_SECOND;
}
millis += roundDiv(nanos, NANOSECONDS_PER_MILLISECOND);
}

@dain
Copy link
Member

dain commented Nov 8, 2023

Can you point to the Hive code that handles this? I don't see a coercion for this in the normal place in Hive

@dain
Copy link
Member

dain commented Nov 8, 2023

The PR description still needs an update

@findinpath findinpath force-pushed the findinpath/hive-timestamp-to-date-coercion branch from 0e486f4 to 607dad0 Compare November 8, 2023 20:18
@findinpath
Copy link
Contributor Author

Rebased on master to address the conflicts.

@dain
Copy link
Member

dain commented Nov 9, 2023

Can you point to the Hive code that handles this? I don't see a coercion for this in the normal place in Hive

I'm still concerned that this is the same behavior as Hive. We can choose to do more than Hive, but that takes thought and discussion. If we can show this is what Hive does, the decission is easy

@Praveen2112
Copy link
Member

@dain The coercion information are captured in this class - https://github.com/apache/hive/blob/branch-3.1/serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/primitive/PrimitiveObjectInspectorUtils.java.

In specific for types which can be coerced into date are mentioned here - https://github.com/apache/hive/blob/af7059e2bdc8b18af42e0b7f7163b923a0bfd424/serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/primitive/PrimitiveObjectInspectorUtils.java#L1111

These rules are summarized in ORC code base - Ref - https://github.com/apache/orc/blob/branch-1.6/java/core/src/java/org/apache/orc/impl/ConvertTreeReaderFactory.java

  /**
   * (Rules from Hive's PrimitiveObjectInspectorUtils conversion)
   *
   * To BOOLEAN, BYTE, SHORT, INT, LONG:
   *   Convert from (BOOLEAN, BYTE, SHORT, INT, LONG) with down cast if necessary.
   *   Convert from (FLOAT, DOUBLE) using type cast to long and down cast if necessary.
   *   Convert from DECIMAL from longValue and down cast if necessary.
   *   Convert from STRING using LazyLong.parseLong and down cast if necessary.
   *   Convert from (CHAR, VARCHAR) from Integer.parseLong and down cast if necessary.
   *   Convert from TIMESTAMP using timestamp getSeconds and down cast if necessary.
   *
   *   AnyIntegerFromAnyIntegerTreeReader (written)
   *   AnyIntegerFromFloatTreeReader (written)
   *   AnyIntegerFromDoubleTreeReader (written)
   *   AnyIntegerFromDecimalTreeReader (written)
   *   AnyIntegerFromStringGroupTreeReader (written)
   *   AnyIntegerFromTimestampTreeReader (written)
   *
   * To FLOAT/DOUBLE:
   *   Convert from (BOOLEAN, BYTE, SHORT, INT, LONG) using cast
   *   Convert from FLOAT using cast
   *   Convert from DECIMAL using getDouble
   *   Convert from (STRING, CHAR, VARCHAR) using Double.parseDouble
   *   Convert from TIMESTAMP using timestamp getDouble
   *
   *   FloatFromAnyIntegerTreeReader (existing)
   *   FloatFromDoubleTreeReader (written)
   *   FloatFromDecimalTreeReader (written)
   *   FloatFromStringGroupTreeReader (written)
   *
   *   DoubleFromAnyIntegerTreeReader (existing)
   *   DoubleFromFloatTreeReader (existing)
   *   DoubleFromDecimalTreeReader (written)
   *   DoubleFromStringGroupTreeReader (written)
   *
   * To DECIMAL:
   *   Convert from (BOOLEAN, BYTE, SHORT, INT, LONG) using to HiveDecimal.create()
   *   Convert from (FLOAT, DOUBLE) using to HiveDecimal.create(string value)
   *   Convert from (STRING, CHAR, VARCHAR) using HiveDecimal.create(string value)
   *   Convert from TIMESTAMP using HiveDecimal.create(string value of timestamp getDouble)
   *
   *   DecimalFromAnyIntegerTreeReader (existing)
   *   DecimalFromFloatTreeReader (existing)
   *   DecimalFromDoubleTreeReader (existing)
   *   DecimalFromStringGroupTreeReader (written)
   *
   * To STRING, CHAR, VARCHAR:
   *   Convert from (BYTE, SHORT, INT, LONG) using to string conversion
   *   Convert from BOOLEAN using boolean (True/False) conversion
   *   Convert from (FLOAT, DOUBLE) using to string conversion
   *   Convert from DECIMAL using HiveDecimal.toString
   *   Convert from CHAR by stripping pads
   *   Convert from VARCHAR with value
   *   Convert from TIMESTAMP using Timestamp.toString
   *   Convert from DATE using Date.toString
   *   Convert from BINARY using Text.decode
   *
   *   StringGroupFromAnyIntegerTreeReader (written)
   *   StringGroupFromBooleanTreeReader (written)
   *   StringGroupFromFloatTreeReader (written)
   *   StringGroupFromDoubleTreeReader (written)
   *   StringGroupFromDecimalTreeReader (written)
   *
   *   String from Char/Varchar conversion
   *   Char from String/Varchar conversion
   *   Varchar from String/Char conversion
   *
   *   StringGroupFromTimestampTreeReader (written)
   *   StringGroupFromDateTreeReader (written)
   *   StringGroupFromBinaryTreeReader *****
   *
   * To TIMESTAMP:
   *   Convert from (BOOLEAN, BYTE, SHORT, INT, LONG) using TimestampWritable.longToTimestamp
   *   Convert from (FLOAT, DOUBLE) using TimestampWritable.doubleToTimestamp
   *   Convert from DECIMAL using TimestampWritable.decimalToTimestamp
   *   Convert from (STRING, CHAR, VARCHAR) using string conversion
   *   Or, from DATE
   *
   *   TimestampFromAnyIntegerTreeReader (written)
   *   TimestampFromFloatTreeReader (written)
   *   TimestampFromDoubleTreeReader (written)
   *   TimestampFromDecimalTreeReader (written)
   *   TimestampFromStringGroupTreeReader (written)
   *   TimestampFromDateTreeReader
   *
   *
   * To DATE:
   *   Convert from (STRING, CHAR, VARCHAR) using string conversion.
   *   Or, from TIMESTAMP.
   *
   *  DateFromStringGroupTreeReader (written)
   *  DateFromTimestampTreeReader (written)
   *
   * To BINARY:
   *   Convert from (STRING, CHAR, VARCHAR) using getBinaryFromText
   *
   *  BinaryFromStringGroupTreeReader (written)
   *
   * (Notes from StructConverter)
   *
   * To STRUCT:
   *   Input must be data type STRUCT
   *   minFields = Math.min(numSourceFields, numTargetFields)
   *   Convert those fields
   *   Extra targetFields to NULL
   *
   * (Notes from ListConverter)
   *
   * To LIST:
   *   Input must be data type LIST
   *   Convert elements
   *
   * (Notes from MapConverter)
   *
   * To MAP:
   *   Input must be data type MAP
   *   Convert keys and values
   *
   * (Notes from UnionConverter)
   *
   * To UNION:
   *   Input must be data type UNION
   *   Convert value for tag
   */ 

@findinpath
Copy link
Contributor Author

findinpath commented Nov 9, 2023

I still have to verify a bit how is the coercion of timestamp values is happening in Hive & Trino:

  • 2022-12-31 23:59:59.999
  • 2022-12-31 23:59:59.999999
  • 2022-12-31 23:59:59.999999999

UPDATE: Manual tests are successful.Covered in the product tests as well.

@findinpath findinpath force-pushed the findinpath/hive-timestamp-to-date-coercion branch from 607dad0 to ef45d35 Compare November 9, 2023 19:11
@findinpath
Copy link
Contributor Author

Thank you to all the reviewers for effort put so far on this PR.
This was a very informative process for me.

AC @dain and @Praveen2112

@findinpath findinpath force-pushed the findinpath/hive-timestamp-to-date-coercion branch from ef45d35 to 2bf4f5f Compare November 9, 2023 22:06
@findinpath findinpath requested a review from dain November 10, 2023 08:07
@dain
Copy link
Member

dain commented Nov 10, 2023

@Praveen2112 🤯. I didn't know there were so may supported coercions. The link I posted I found from some issue in Hive about allowed changes with alter table change data type. Let's make sure the umbrella issue has all of these.

@Praveen2112
Copy link
Member

@dain I'll be updating them in the umbrella task.

@Praveen2112 Praveen2112 merged commit 0e4c325 into trinodb:master Nov 11, 2023
58 checks passed
@Praveen2112
Copy link
Member

@findinpath Thanks for working on this !!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

3 participants