Spark 3.5: Implement RewriteTablePath #11555

szehon-ho · 2024-11-15T05:28:47Z

This is the implementation for #10920 (an action to prepare metadata for an Iceberg table for DR copy)

This has been used in production for awhile in our setup, although support for rewrite of V2 position delete is new. I performed the following cleanups while contributing it.

Made RewriteTableSparkAction code more functional (avoid using member variable on the action to track state)
Moved some RewriteTableSparkAction code to core Utll classes to avoid having to make some classes public as was previously done.

RussellSpitzer · 2024-11-20T21:55:46Z

api/src/main/java/org/apache/iceberg/actions/RewriteTablePath.java

-     * Path to a comma-separated list of source and target paths for all files added to the table
-     * between startVersion and endVersion, including original data files and metadata files
-     * rewritten to staging.
+     * Result file list location. This file contains a 'copy plan', a comma-separated list of all


This still feels a little ambiguous.

Maybe?

A file containing a listing of both original file names and file names under the new prefix, comma separated.

we could give an example of the format to make it clear like this:

sourcepath/datafile1.parquet targetpath/datafile1.parquet, sourcepath/datafile2.parquet targetpath/datafile2.parquet, ...

Put the example in the suggestion and reworded.

flyrain

Thanks @szehon-ho for working on it. Left some comments. The major concern is the perf in case of multiple delete files in a manifest file.

flyrain · 2024-11-28T19:49:42Z

api/src/main/java/org/apache/iceberg/actions/RewriteTablePath.java

-     * Path to a comma-separated list of source and target paths for all files added to the table
-     * between startVersion and endVersion, including original data files and metadata files
-     * rewritten to staging.
+     * Result file list location. This file contains a 'copy plan', a comma-separated list of all


we could give an example of the format to make it clear like this:

sourcepath/datafile1.parquet targetpath/datafile1.parquet, sourcepath/datafile2.parquet targetpath/datafile2.parquet, ...

core/src/main/java/org/apache/iceberg/TableMetadataUtil.java

flyrain · 2024-11-28T20:08:30Z

core/src/main/java/org/apache/iceberg/TableMetadataUtil.java

+        metadata.statisticsFiles(),
+        metadata.partitionStatisticsFiles(),


We will need to rewrite statistic file path as well, but I'm OK to support it in a follow-up PR.

...k/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteTablePathSparkAction.java

core/src/main/java/org/apache/iceberg/RewriteTablePathUtil.java

...k/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteTablePathSparkAction.java

core/src/main/java/org/apache/iceberg/RewriteTablePathUtil.java

szehon-ho · 2024-12-07T02:05:43Z

@flyrain thanks for review! I spent some time cleaning it up. The comments should be addressed, let me know if I missed any.

More things are moved to RewriteTablePathsUtil, removing the need to make ManifestLists public.
The rewrite of position deletes is now a Spark job for better performance. It took me a bit of time to realize, I need to rewrite the delete entry's bounds as well in the rewrite manifest job (before because I rewrote the position delete file inline, the bounds were correctly populated, but now I have to fix them because the delete file is rewritten separately).

szehon-ho · 2024-12-07T02:08:41Z

core/src/main/java/org/apache/iceberg/util/ContentFileUtil.java

+   * @param targetPrefix target prefix which will replace it
+   * @return metrics for the new delete file entry
+   */
+  public static Metrics replacePathBounds(


this new logic is to rewrite the position delete entry's file_path bounds, which are used in the delete index. now we moved the position delete rewrite to a separate spark job instead of inline with writing the delete entry, we need to fix the bounds for the metadata here.

flyrain

Thanks @szehon-ho for working on it. LGTM overall. Left questions and minor comments.

core/src/main/java/org/apache/iceberg/TableMetadataUtil.java

core/src/main/java/org/apache/iceberg/RewriteTablePathUtil.java

flyrain · 2025-01-03T02:11:18Z

core/src/main/java/org/apache/iceberg/util/ContentFileUtil.java

+      return metricsWithoutPathBounds(deleteFile);
+    }
+
+    if (lowerPathBound.equals(upperPathBound)) {


I might miss the context. Do we have to handle the cases that lower path bound doesn't equals to upper path bound? Or are always the same in case of file path in a pos delete file? How does filtering work in that case?

Yea if I understand it correctly, the existing logic doesnt do filtering unless the bounds are equal, see just above this method:
https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/util/ContentFileUtil.java#L81

I believe the reasoning is, its not even worth it (as paths are so random). So for simplicity I just remove these metrics if the two bounds are not the same, what do you think? Or let me know if you prefer I just change both paths.

Another context, I think having the same upper/lower bound is related to file-scoped delete files that seems will be the default soon, ref: https://lists.apache.org/thread/2mpmdp4ns66gp595c9b3clstgskbslsp hence my thought its not worth to copy the metrics for the other case (which doesnt even attempt filtering).

I think it's worth to have a comment to clarify the behavior. It isn't a blocker to me though.

dramaticlly

thank you @szehon-ho for your great work! I left some nitpicks but overall LGTM

core/src/main/java/org/apache/iceberg/RewriteTablePathUtil.java

...k/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteTablePathSparkAction.java

...k/v3.5/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteTablePathsAction.java

szehon-ho · 2025-01-05T06:08:35Z

Rebased and addressed review comments, thanks a lot @flyrain and @dramaticlly for reviewing this big change.

szehon-ho · 2025-01-05T15:28:48Z

...k/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteTablePathSparkAction.java

+    }
+  }
+
+  public static class RewriteContentFileResult extends RewriteResult<ContentFile<?>> {


Note: this extra override class is added because Spark encoder didnt like class (RewriteResult) that has type parameter T, hence the need to make a class with concrete type parameter.

It also then needed some extra methods to be able to cleanly aggregate RewriteResult of different content file types , now with change of #11555 (comment) that makes the data file entry logic also return RewriteResult.

szehon-ho · 2025-01-05T15:31:59Z

core/src/main/java/org/apache/iceberg/io/DeleteSchemaUtil.java

@@ -43,4 +43,15 @@ public static Schema pathPosSchema() {
  public static Schema posDeleteSchema(Schema rowSchema) {
    return rowSchema == null ? pathPosSchema() : pathPosSchema(rowSchema);
  }
+
+  public static Schema posDeleteReadSchema(Schema rowSchema) {


Somehow after the rebase this is needed for position delete rewrite (there must be some intervening change related to delete readers). Previously this used the method above pathPosSchema(rowSchema) for the read schema, which has 'row' as required. This would fail saying 'row' is required but not found in the delete file, as 'row' is usually not set.

Note that Spark and all readers actually don't include the 'row' field in the read schema https://github.com/apache/iceberg/blob/main/data/src/main/java/org/apache/iceberg/data/BaseDeleteLoader.java#L70.

But here, I do want to read the 'row' field and preserve it if it is set by some engine.
So I am taking the strategy of RewritePositionDelete and actually reading this field, but as optional to avoid the assert error if it is not found. https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/PositionDeletesTable.java#L118. (the reader there is derived from schema of metadata table PositionDeletesTable).

This might not directly related to this PR, but seems like the column row should NOT be marked as required anywhere.

flyrain

Thanks @szehon-ho ! LGTM.

flyrain · 2025-01-06T17:03:34Z

core/src/main/java/org/apache/iceberg/util/ContentFileUtil.java

+      return metricsWithoutPathBounds(deleteFile);
+    }
+
+    if (lowerPathBound.equals(upperPathBound)) {


I think it's worth to have a comment to clarify the behavior. It isn't a blocker to me though.

core/src/main/java/org/apache/iceberg/io/DeleteSchemaUtil.java

flyrain · 2025-01-06T17:24:25Z

core/src/main/java/org/apache/iceberg/io/DeleteSchemaUtil.java

@@ -43,4 +43,15 @@ public static Schema pathPosSchema() {
  public static Schema posDeleteSchema(Schema rowSchema) {
    return rowSchema == null ? pathPosSchema() : pathPosSchema(rowSchema);
  }
+
+  public static Schema posDeleteReadSchema(Schema rowSchema) {


This might not directly related to this PR, but seems like the column row should NOT be marked as required anywhere.

dramaticlly

Thanks Szehon for all the changes and put together RewriteResult interface and RewriteContentFileResult!

szehon-ho · 2025-01-07T09:02:18Z

@flyrain addressed comments, and also added a unit test for testing case with 'row' column set on position delete file and ensuring that the value is carried over if set.

flyrain · 2025-01-07T18:52:43Z

Thanks @szehon-ho for working on this. Feel free to merge it.

szehon-ho · 2025-01-08T14:55:00Z

Thanks a lot @flyrain and @dramaticlly for review, we can continue improving this in follow up prs

github-actions bot added API spark core labels Nov 15, 2024

szehon-ho force-pushed the rewrite_table_path branch 2 times, most recently from 4e18bc3 to be5d2f5 Compare November 16, 2024 09:53

RussellSpitzer reviewed Nov 20, 2024

View reviewed changes

manuzhang mentioned this pull request Nov 26, 2024

How to move Iceberg table from one location to another #11645

Open

flyrain reviewed Nov 28, 2024

View reviewed changes

szehon-ho force-pushed the rewrite_table_path branch 3 times, most recently from 9bfd8d0 to 6880510 Compare December 7, 2024 01:49

szehon-ho force-pushed the rewrite_table_path branch from 5445f00 to 36b3365 Compare December 7, 2024 02:07

szehon-ho commented Dec 7, 2024

View reviewed changes

flyrain reviewed Jan 3, 2025

View reviewed changes

dramaticlly approved these changes Jan 3, 2025

View reviewed changes

szehon-ho added 6 commits January 5, 2025 14:07

Spark 3.5: Implement RewriteTablePath

1890605

Checkstyle

a28cd97

More checkstyle

019b5ac

Use spark action to rewrite position deletes

e5649b8

More review comments

878a062

Review comments

e33d1a5

szehon-ho force-pushed the rewrite_table_path branch from 33d53ce to e33d1a5 Compare January 5, 2025 06:07

Fix unit test

3eec2fa

szehon-ho commented Jan 5, 2025

View reviewed changes

szehon-ho closed this Jan 6, 2025

szehon-ho reopened this Jan 6, 2025

flyrain approved these changes Jan 6, 2025

View reviewed changes

dramaticlly approved these changes Jan 6, 2025

View reviewed changes

More review comments

216043d

szehon-ho merged commit 39a4cfd into apache:main Jan 8, 2025
50 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark 3.5: Implement RewriteTablePath #11555

Spark 3.5: Implement RewriteTablePath #11555

szehon-ho commented Nov 15, 2024

RussellSpitzer Nov 20, 2024

flyrain Nov 28, 2024

szehon-ho Dec 7, 2024

flyrain left a comment

flyrain Nov 28, 2024

flyrain Nov 28, 2024

szehon-ho commented Dec 7, 2024 •

edited

Loading

szehon-ho Dec 7, 2024

flyrain left a comment

flyrain Jan 3, 2025

szehon-ho Jan 4, 2025 •

edited

Loading

szehon-ho Jan 4, 2025 •

edited

Loading

flyrain Jan 6, 2025

szehon-ho Jan 7, 2025

dramaticlly left a comment

szehon-ho commented Jan 5, 2025

szehon-ho Jan 5, 2025 •

edited

Loading

szehon-ho Jan 5, 2025 •

edited

Loading

flyrain Jan 6, 2025

flyrain left a comment

flyrain Jan 6, 2025

flyrain Jan 6, 2025

dramaticlly left a comment

szehon-ho commented Jan 7, 2025 •

edited

Loading

flyrain commented Jan 7, 2025

szehon-ho commented Jan 8, 2025

		metadata.statisticsFiles(),
		metadata.partitionStatisticsFiles(),

Spark 3.5: Implement RewriteTablePath #11555

Spark 3.5: Implement RewriteTablePath #11555

Conversation

szehon-ho commented Nov 15, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

flyrain left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

szehon-ho commented Dec 7, 2024 • edited Loading

Choose a reason for hiding this comment

flyrain left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

szehon-ho Jan 4, 2025 • edited Loading

Choose a reason for hiding this comment

szehon-ho Jan 4, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dramaticlly left a comment

Choose a reason for hiding this comment

szehon-ho commented Jan 5, 2025

szehon-ho Jan 5, 2025 • edited Loading

Choose a reason for hiding this comment

szehon-ho Jan 5, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

flyrain left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dramaticlly left a comment

Choose a reason for hiding this comment

szehon-ho commented Jan 7, 2025 • edited Loading

flyrain commented Jan 7, 2025

szehon-ho commented Jan 8, 2025

szehon-ho commented Dec 7, 2024 •

edited

Loading

szehon-ho Jan 4, 2025 •

edited

Loading

szehon-ho Jan 4, 2025 •

edited

Loading

szehon-ho Jan 5, 2025 •

edited

Loading

szehon-ho Jan 5, 2025 •

edited

Loading

szehon-ho commented Jan 7, 2025 •

edited

Loading