Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for creating and dropping tables using Iceberg object store #20555

Merged
merged 1 commit into from
Nov 22, 2024

Conversation

jakelong95
Copy link
Contributor

@jakelong95 jakelong95 commented Feb 2, 2024

Description

Currently, Trino is only able to write to and read from tables that use Iceberg's ObjectStorageLocationProvider, but is unable to create or drop tables using the location provider.

This PR enables Trino to create tables using Iceberg's object storage by adding the following properties:

  • iceberg.object-store.enabled - Corresponds with Spark's write.object-storage.enabled, which enables use of Iceberg's ObjectStorageLocationProvider
  • iceberg.data-location - Corresponds to Spark's write.data.path, which sets where Iceberg data files will be written

Enabling the object store property and setting the data path will cause Iceberg to provide data file locations prefixed by a deterministic hash generated from the file name in the location specified by write.data.path, which will help reduce throttling from cloud storage systems like S3 by evenly distributing files across multiple prefixes. Iceberg's own documentation on this feature has more information.

For example, without the object store enabled you would get the following locations for data files, all under the same prefix iceberg-tables/myschema/mytable/data:

  • s3://mybucket/iceberg-tables/myschema/mytable/data/file1.parquet
  • s3://mybucket/iceberg-tables/myschema/mytable/data/file2.parquet
  • s3://mybucket/iceberg-tables/myschema/mytable/data/file3.parquet

But, if you enable the object store, you would get the following locations, each in their own prefix:

  • s3://mybucket/iceberg-tables/myschema/mytable/data/<file1 hash>/file1.parquet
  • s3://mybucket/iceberg-tables/myschema/mytable/data/<file2 hash>/file2.parquet
  • s3://mybucket/iceberg-tables/myschema/mytable/data/<file3 hash>/file3.parquet

Additional context and related issues

This PR maintains compatibility with Spark by using the table properties write.object-storage.enabled and write.data.path, which had previously been set up to allow Trino to write to Iceberg tables using the ObjectStorageLocationProvider in #8573

Fixes #8861

Release notes

(x) Release notes are required, with the following suggested text:

# Iceberg
* Add support for [object store file layout](https://iceberg.apache.org/docs/latest/aws/#object-store-file-layout). ({issue}`8861`)

Copy link

cla-bot bot commented Feb 2, 2024

Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to [email protected]. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla

@jakelong95
Copy link
Contributor Author

Re-created from #20516

@github-actions github-actions bot added tests:hive iceberg Iceberg connector labels Feb 2, 2024
@jakelong95 jakelong95 force-pushed the jl-iceberg-object-storage branch from cda8e04 to 4f83e8a Compare April 17, 2024 18:28
@cla-bot cla-bot bot added the cla-signed label Apr 17, 2024
@jakelong95 jakelong95 marked this pull request as ready for review April 17, 2024 20:38
@mosabua mosabua requested a review from electrum April 17, 2024 20:55
@findinpath findinpath requested a review from homar April 26, 2024 10:07
Copy link

This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua

@github-actions github-actions bot added the stale label May 17, 2024
@mosabua
Copy link
Member

mosabua commented May 29, 2024

Any thoughts @amogh-jahagirdar

@github-actions github-actions bot removed the stale label May 30, 2024
Copy link

This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua

Copy link

Closing this pull request, as it has been stale for six weeks. Feel free to re-open at any time.

@github-actions github-actions bot closed this Jul 15, 2024
@jakelong95 jakelong95 reopened this Jul 15, 2024
@mosabua mosabua added stale-ignore Use this label on PRs that should be ignored by the stale bot so they are not flagged or closed. and removed stale labels Jul 15, 2024
@mosabua
Copy link
Member

mosabua commented Jul 15, 2024

I added the stale-ignore label so the PR stays open. Please rebase as a next step.

@jakelong95 jakelong95 force-pushed the jl-iceberg-object-storage branch 3 times, most recently from 6ed669a to a7454d1 Compare July 15, 2024 19:19
@jakelong95
Copy link
Contributor Author

Hi @mosabua, pushed the rebase in. Looks like the failed check timed out but I'm not able to restart it.
I'm also not sure what the process is for bumping Trino PRs, should I re-post in the Trino slack?

@mosabua
Copy link
Member

mosabua commented Jul 29, 2024

Please ask for help on the #iceberg or the #core-dev channel

@jakelong95 jakelong95 force-pushed the jl-iceberg-object-storage branch 2 times, most recently from ed46a94 to e452e6d Compare August 20, 2024 23:34
// TODO: support path override in Iceberg table creation: https://github.com/trinodb/trino/issues/8861
if (table.properties().containsKey(OBJECT_STORE_PATH) ||
table.properties().containsKey("write.folder-storage.path") || // Removed from Iceberg as of 0.14.0, but preserved for backward compatibility
table.properties().containsKey(WRITE_METADATA_LOCATION) ||
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why was WRITE_METADATA_LOCATION removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I totally follow. I'm not removing the property, just its usage here since Trino will be able to drop tables created with Iceberg's object storage file layout with this change

@grantatspothero
Copy link
Contributor

grantatspothero commented Nov 4, 2024

Adding support for updating this table property would be useful, see IcebergMetadata#setTableProperties

It is a common to create a table and only later encounter object storage rate limiting. Not allowing updates would make this workflow to fix rate limiting a pain.

Not sure if updates should go in this PR or another one @ebyhr but heads up.

@ebyhr ebyhr force-pushed the jl-iceberg-object-storage branch 3 times, most recently from 38fe6eb to 78224d6 Compare November 7, 2024 05:10
@jakelong95
Copy link
Contributor Author

Adding support for updating this table property would be useful, see IcebergMetadata#setTableProperties

It is a common to create a table and only later encounter object storage rate limiting. Not allowing updates would make this workflow to fix rate limiting a pain.

Not sure if updates should go in this PR or another one @ebyhr but heads up.

That's a good callout. I'll add it in this PR

@ebyhr
Copy link
Member

ebyhr commented Nov 7, 2024

That's a good callout. I'll add it in this PR

@jakelong95 I already push the code to support SET PROPERTIES.

@jakelong95
Copy link
Contributor Author

That's a good callout. I'll add it in this PR

@jakelong95 I already push the code to support SET PROPERTIES.

Yeah, saw that you added adding in object_store_enabled. I'm adding in support for modifying data_location since the two properties are related

@jakelong95 jakelong95 force-pushed the jl-iceberg-object-storage branch from af9410b to 510f032 Compare November 8, 2024 23:07
@jakelong95
Copy link
Contributor Author

@ebyhr Just pushed up the following changes:

  1. Ability to update data_location for a table after creation, and the associated documentation
  2. Test cases for adding object storage or a data location after the table was created
  3. Assertions for Spark <-> Trino compatibility with the object_store_enabled and data_location properties
  4. Improved assertions in some of the tests in my original commit to better validate the data file location matches the expected format.
    • For example, one of the assertions matches on the regex local:///table-location/xyz/.{6}/tpch/test_create_and_drop.* - After the data location local:///table-location/xyz it has a regex match on 6 characters, which is what Iceberg uses for the deterministic hashes based on the file name. Just wanted to be extra sure :)

Let me know if you have any more questions or concerns!

@jakelong95 jakelong95 requested a review from ebyhr November 12, 2024 18:06
@jakelong95 jakelong95 force-pushed the jl-iceberg-object-storage branch from c0690b8 to 5bab5a9 Compare November 13, 2024 19:28
@jakelong95 jakelong95 requested a review from ebyhr November 13, 2024 19:28
@ebyhr ebyhr force-pushed the jl-iceberg-object-storage branch from 0a268a3 to 7f55756 Compare November 16, 2024 06:46
@ebyhr ebyhr changed the title Iceberg: Add support for creating and dropping tables using Iceberg object store Add support for creating and dropping tables using Iceberg object store Nov 16, 2024
@jakelong95 jakelong95 force-pushed the jl-iceberg-object-storage branch 2 times, most recently from 0c7e918 to 98c26d0 Compare November 18, 2024 21:20
@jakelong95
Copy link
Contributor Author

@ebyhr fixed the failing unit test that I had added.

A new failure from one of the existing unit tests TestIcebergV2.testOptimizeDuringWriteOperations came up though. I'm unable to replicate this failure on my machine, and it doesn't look like any of my changes would have affected it.

@ebyhr ebyhr force-pushed the jl-iceberg-object-storage branch 3 times, most recently from 1b827b6 to baafef2 Compare November 20, 2024 03:47
@ebyhr ebyhr force-pushed the jl-iceberg-object-storage branch from baafef2 to 9f42c18 Compare November 20, 2024 07:19
@ebyhr ebyhr merged commit cd1965d into trinodb:master Nov 22, 2024
51 checks passed
@github-actions github-actions bot added this to the 466 milestone Nov 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed docs iceberg Iceberg connector stale-ignore Use this label on PRs that should be ignored by the stale bot so they are not flagged or closed.
Development

Successfully merging this pull request may close these issues.

Support path override in Iceberg table creation
7 participants