Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add $partitions table to Delta Connector #19722

Conversation

whitleykeith
Copy link
Contributor

@whitleykeith whitleykeith commented Nov 13, 2023

Description

This PR implements the $partitions system table for the Delta Lake connector, and behaves similarly to the Hive analog. It additionally adds a .devcontainer.json to support running Trino in a devcontainer/codespace. I did this primarily for my own workflow and can remove it if it's not desired.

An implementation wrinkle: Due to how the AddFileEntry.getCanonicalPartitionValues returns strings for partition values and some issues with casting back to their original types, the columns in the result table are always VARCHAR, even if the underlying column is a different type. I called this out in the related doc updates.

Additional context and related issues

Closes #18590

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# Delta Lake
* Add `$partitions` system table to the Delta Connector ({issue}`18590`)

Copy link

cla-bot bot commented Nov 13, 2023

Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to [email protected]. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla

@github-actions github-actions bot added docs delta-lake Delta Lake connector labels Nov 13, 2023
Copy link

cla-bot bot commented Nov 13, 2023

Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to [email protected]. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla

@findinpath findinpath self-requested a review November 13, 2023 17:26
this.partitionColumns = requireNonNull(partitionColumns, "partitionColumns is null");

ImmutableList.Builder<ColumnMetadata> columns = ImmutableList.builder();
partitionColumns.forEach(column -> columns.add(new ColumnMetadata(column.getBaseColumnName(), VARCHAR)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't follow the reasoning behind this constraint - why not using the actual type of the partition column?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially tried to do that, but there were issues with casting back to the original type as AddFileEntry.getCanonicalPartitionValues() returns a string represenation of the value

Copy link
Member

@ebyhr ebyhr Nov 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be the original type. Also, we need to discuss the result format (Hive or Iceberg)

-- Hive
select * from "test$partitions";
 b
---
 1

-- Iceberg
select * from "test$partitions";
 partition | record_count | file_count | total_size |                       data
-----------+--------------+------------+------------+--------------------------------------------------
 {b=1}     |            1 |          1 |        317 | {a={min=1, max=1, null_count=0, nan_count=NULL}}

Copy link
Contributor Author

@whitleykeith whitleykeith Nov 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try to see if I can get the types correct 👍

Good point on format, I've only used the hive and delta connectors in practice so went with the original Hive format. The Iceberg format has a lot more info, but I'm not sure how best to materialize that from the transaction log. I'll defer to one of you on which format we want

Thanks for the review as well, I'll take a look at the other comments and fix this PR up. Appreciate the patience for a first time contributor 🙏

This commit adds the `DeltaLakePartitionsTable` and `DeltaLakePartitionsTableProvider`
classes to support the `$partitions` system table in the Delta Connector.

As the partition values casted as string from internal calls, the resulting table
will always return VARCHAR, so called that out in the docs.
@whitleykeith whitleykeith force-pushed the whitleykeith/delta-partitions-table branch from 601089c to 66626ae Compare November 13, 2023 17:47
Copy link

cla-bot bot commented Nov 13, 2023

Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to [email protected]. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla

Copy link

cla-bot bot commented Nov 13, 2023

Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to [email protected]. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla

@@ -0,0 +1,18 @@
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove an unrelated change.

@@ -704,6 +704,49 @@ delta.feature.columnMapping | supported |

(delta-lake-special-columns)=

#### `$partitions` table

The `$partitions` table provides access to a specified Delta Lake Table's partitions keys and values. Each table row is a unique partition value, with the columns being the partition keys.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The line is too long. Please wrap at 80 characters.

Comment on lines +726 to +730
name | comment | regionkey |
----------------------------+-----------------+---------------------+
foo | foo | east |
bar | bar | west |
baz | baz | west |
Copy link
Member

@ebyhr ebyhr Nov 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This table looks too wide. Please narrow it. Same for the following $partitions result.

}

private Optional<SystemTable> getRawSystemTable(SchemaTableName systemTableName)
private Optional<SystemTable> getRawSystemTable(SchemaTableName systemTableName, ConnectorSession session)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace the order between SchemaTableName & ConnectorSession.

Comment on lines +113 to +114
assertUpdate("INSERT INTO " + tableName + " VALUES (0, CAST('2019-09-08' AS DATE)), (1, CAST('2019-09-09' AS DATE)), (2, CAST('2019-09-09' AS DATE))", 3);
assertUpdate("INSERT INTO " + tableName + " VALUES (3, CAST('2019-09-09' AS DATE)), (4, CAST('2019-09-10' AS DATE)), (5, CAST('2019-09-10' AS DATE))", 3);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please test with NULL partitions.

@@ -0,0 +1,18 @@
{
Copy link
Member

@ebyhr ebyhr Nov 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rebase on master instead of merge and then fix logical conflicts.

Error:  Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.11.0:compile (default-compile) on project trino-delta-lake: Compilation failure: Compilation failure: 
Error:  /home/runner/work/trino/trino/plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakePartitionsTable.java:[95,49] cannot find symbol
Error:    symbol:   method getSnapshot(io.trino.spi.connector.ConnectorSession,io.trino.spi.connector.SchemaTableName,java.lang.String,java.util.Optional<java.lang.Object>)
Error:    location: variable transactionLogAccess of type io.trino.plugin.deltalake.transactionlog.TransactionLogAccess
Error:  /home/runner/work/trino/trino/plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakePartitionsTableProvider.java:[57,63] cannot find symbol
Error:    symbol:   method getSnapshot(io.trino.spi.connector.ConnectorSession,io.trino.spi.connector.SchemaTableName,java.lang.String,java.util.Optional<java.lang.Object>)
Error:    location: variable transactionLogAccess of type io.trino.plugin.deltalake.transactionlog.TransactionLogAccess
Error:  -> [Help 1]

@Test
public void testPartitionsTable()
{
String tableName = "test_simple_partitions_table";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All 3 tests have _simple_ in the table name. It doesn't provide any benefits unless we have complex tests. I would remove them.

Comment on lines +172 to +177
assertUpdate("INSERT INTO " + tableName + " VALUES (3, CAST('2019-09-09' AS DATE)), (4, CAST('2019-09-10' AS DATE)), (5, CAST('2019-09-10' AS DATE))", 3);
assertUpdate("UPDATE " + tableName + " SET _bigint = 50 WHERE _bigint = BIGINT '5'", 1);
assertUpdate("DELETE FROM " + tableName + " WHERE _date = DATE '2019-09-08'", 1);
assertQuerySucceeds("ALTER TABLE " + tableName + " EXECUTE OPTIMIZE");
assertQuery("SELECT count(*) FROM " + tableName, "VALUES 5");
assertQuery("SELECT count(*) FROM \"" + tableName + "$partitions\"", "VALUES 0");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove these statements. Single INSERT is sufficient for verifying $partitons with unpartitioned table.

@@ -704,6 +704,49 @@ delta.feature.columnMapping | supported |

(delta-lake-special-columns)=

#### `$partitions` table
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current place (after (delta-lake-special-columns)=) is wrong.

Comment on lines +712 to +713


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove a redundant empty line.

Copy link

This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua

@github-actions github-actions bot added the stale label Jan 10, 2024
@mosabua
Copy link
Member

mosabua commented Jan 11, 2024

👋 @whitleykeith - this PR has become inactive. If you're still interested in working on it, please let us know, and we can try to get reviewers to help with that.

We're working on closing out old and inactive PRs, so if you're too busy or this has too many merge conflicts to be worth picking back up, we'll be making another pass to close it out in a few weeks.

@github-actions github-actions bot removed the stale label Jan 12, 2024
Copy link

github-actions bot commented Feb 2, 2024

This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua

@github-actions github-actions bot added the stale label Feb 2, 2024
@mosabua
Copy link
Member

mosabua commented Feb 2, 2024

@findinpath @ebyhr .. is this still useful/valid or should we close?

@github-actions github-actions bot removed the stale label Feb 5, 2024
@whitleykeith
Copy link
Contributor Author

@mosabua I'm the original author of the PR. Sorry for not chiming in here, I chatted with @findinpath last month in slack about this. We use Trino a lot, especially the Delta connector, and I would love to be able to get this in a good state. I ran out of time when I first started development on this but I'm hoping I can block off some time in the next few weeks to resolve comments + make necessary changes

@mosabua
Copy link
Member

mosabua commented Feb 14, 2024

Thats great news @whitleykeith .. feel free to continue to work with @findinpath and others and let me know if you need any other help.

@whitleykeith
Copy link
Contributor Author

It was easier to make a brand new PR since there was a lot of changes (including a java version upgrade), new PR is here: #20735

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
delta-lake Delta Lake connector docs
Development

Successfully merging this pull request may close these issues.

Create $partitions system table for Delta Tables
4 participants