-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spark 3.5: Fix Migrate procedure renaming issue for custom catalog #8931
Closed
+77
−2
Closed
Changes from all commits
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QQ: earlier we use to use sourceCatalog as destCatalog too, was this a problem ? can you please add more comments as to why sourceCatalog was picked as
spark_catalog
rather than theglue_catalog
?? since we were calling the migrate procedure from glue_catalog ?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review, @singhpk234.
Yes that was a problem because in the
migrate
, the source catalog can only beSparkSessionCatalog
. Let me elaborate this in addition to what I investigated in #7317 (comment) (if I'm wrong, please correct me).When running
migrate
,checkSourceCatalog
inBaseTableCreationSparkAction
is called along withMigrateTableSparkAction
initialization.The
checkSourceCatalog
comes from theMigrateTableSparkAction
implementation, and the method checks if the sourceCatalog is SparkSessionCatalog or not.So, the sourceCatalog is necessary to be specified the
SparkSessionCatalog
. And ifSparkSessionCatalog
is set andGlueCatalogImpl
is set as its implementation, therenameTable
operation in themigrate
is called viaSparkSessionCatalog
,GlueCatalogImpl
is not used in therenameTable
operation, and then themigrate
query fails.For the custom catalog, specifically Glue Data Catalog, as you know, it originally has the
renameTable
implemenation in Iceberg because the Glue Data Catalog client doesn't support the table rename. I think if it were possible to specify the destCatalog to use the destCatalog impl, it could resolve the restriction of source catalog side impl like Glue Data Catalog.For reference, I tested the following patterns, but all patterns failed:
Pattern 1 - Set SparkCatalog (
glue_catalog
) to the sourceTableSparkSession config:
Migrate query:
Error (partial stacktrace):
Pattern 1' - Set SparkCatalog (
glue_catalog
) to the sourceTable and specify SparkCatalog for Procedure callThe same result was obtained as "Pattern 1" because the catalog name for procedure call doesn't affect the source/destination catalog.
SparkSession config: omitted. The same configuration was used as "Pattern 1".
Migrate query:
Error: omitted. The same error was obtained as "Pattern 1".
Pattern 2 - Set
SparkSessionCatalog
with GlueCatalogImpl to the sourceTableSparkSession config:
Migrate query:
Error (partial stacktrace):
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tomtongue do we know why would the
icebergCatalog.tableExists(from)
fail ? so that we are not using the GlueCatalog implementation (of iceberg) which supports rename and are falling back to the V2SessionCatalog.renameTable which is using GlueMetaStoreClient and doesn't support rename ?iceberg/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkSessionCatalog.java
Lines 293 to 305 in 445664f
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@singhpk234 thanks for checking my comments.
In this part, the
renameTable
checks if the source table is Iceberg or not. However, for themigrate
procedure, the source table must bespark_catalog
(for example, if theglue_catalog
is specified, it fails as described above).So in the
migrate
query,getSessionCatalog().renameTable(from, to);
is called, and the answer for your question below is NOT using the GlueCatalogImpl in Iceberg.As you're mentioning, the
V2SessionCatalog.rename
falls back to GlueMetastoreClient that doesn't support table rename. And the query always fails.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the source table being non-iceberg table is making the check of rename fail and it defers to V2SessionCatalog.rename, and this step can't be skipped as we want to rename the source to backup table to halt the writes. wondering then adding the support for rename in GlueMetaStore client would be best, but it comes with it's own challenges frankly, how we have implemented rename in GlueCatalog of iceberg is a bit tricky we read the source table, create a destination table and then delete the source table (As glue doesn't have rename API). I am not sure if something goes south will we be able to correctly restore the source table state from the backup.
for ex, i am assuming non-iceberg table would track partition info also in glue which i don't think is correctly being propagated in iceberg GlueCatalog impl, thoughts :
iceberg/aws/src/main/java/org/apache/iceberg/aws/glue/GlueCatalog.java
Line 383 in 2c89010
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review, @singhpk234. Yes, as you're saying, the Iceberg GlueCatalogImpl replicates the "partial" metadata in the rename. So if the source Spark/Hive table is partitioned, the restore process will fail as follows:
This error was caused by the partition lost in the renamed table.
So as you know, the way to resolve the migrate restriction, supporting the rename for GlueHiveMetastoreClient should be the best.
At least there are people who have tried to migrate from their table into Iceberg on custom catalog like Glue Catalog. But the migrate query cannot be used because of the rename restriction. So let me consider the better way to resolve this issue. If there's no way to resolve this issue, I think we need to ask the GlueHiveMetastoreClient to support the rename operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, at this point GlueHiveMetastoreClient supporting rename ops seems a reasonable solution to me.