Spark: Bypass Spark's ViewCatalog API when replacing a view #9596

nastra · 2024-01-31T13:20:57Z

Spark's ViewCatalog API doesn't have a replace() in 3.5 as it was only introduced later. Therefore we're bypassing Spark's ViewCatalog so that we can keep the view's history after executing a CREATE OR REPLACE

nastra · 2024-01-31T13:26:29Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SupportsViewReplace.java

+   * @throws NoSuchViewException If the view doesn't exist or is a table
+   * @throws NoSuchNamespaceException If the identifier namespace does not exist (optional)
+   */
+  View replaceView(


I don't think we need to have createOrReplace() and replace() here. Spark's API for these semantics look a bit different and are defined here, so I think just having replaceView() should be ok.

Also that reminds me to change the return type of Spark's replaceView(ViewInfo viewInfo, boolean orCreate) from void to View

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SupportsReplacingViews.java

spark/v3.5/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestViews.java

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java

rdblue · 2024-01-31T17:21:25Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java

+                .withDefaultCatalog(currentCatalog)
+                .withDefaultNamespace(Namespace.of(currentNamespace))
+                .withQuery("spark", sql)
+                .withSchema(icebergSchema)


This works, but we may want to handle field IDs differently in the future. Because there are no data files, it doesn't really matter how we assign IDs, but it is nice to have consistent IDs across schema versions because it could be confusing otherwise.

The table implementation keeps schemas consistent for replace table operations by passing the previous schema in to assign the new IDs, like this:

Schema freshSchema = TypeUtil.assignFreshIds(updatedSchema, schema(), newLastColumnId::incrementAndGet);

This would be a separate PR if we wanted to handle it in ViewMetadata.

I see what you mean and this makes sense. I'll take a closer look at this outside of this PR

I've opened #10253 to address this

rdblue · 2024-01-31T17:23:05Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkCatalog.java

+      } catch (org.apache.iceberg.exceptions.NoSuchNamespaceException e) {
+        throw new NoSuchNamespaceException(currentNamespace);
+      } catch (org.apache.iceberg.exceptions.NoSuchViewException e) {
+        throw new NoSuchViewException(ident);


If this is thrown, then I think the Spark exec node should catch it and create the view. The CREATE OR REPLACE operation is supposed to be idempotent so it should not fail if the view is concurrently dropped.

This probably isn't a big deal, but it would be nice to handle it.

this only exists due to the limited default implementation (which does a drop+create) in Spark's replace() as can be seen here. That being said, I don't think we need to handle NoSuchViewException in our implementation, since ours is idempotent. I went ahead and removed catching this here.

Why don't we need to handle the exception here? Translating to the right Spark exception seems like something we should definitely do.

I realize that we already check whether the view exists, but if the view is dropped concurrently, it could still fail. It's an edge case, but the right thing to do is to catch and issue the create, right? CREATE OR REPLACE should not ever fail because the view doesn't exist.

This isn't a blocker, so I'll merge this PR. We can follow up on this later.

I'll follow-up on this

@rdblue I've handled this in #9623

rdblue

Looks great! There are some minor things, but nothing that is a big blocker.

Spark's `ViewCatalog` API doesn't have a `replace()` in 3.5 as it was only introduced later. Therefore we're bypassing Spark's `ViewCatalog` so that we can keep the view's history after executing a `CREATE OR REPLACE`

) Spark's `ViewCatalog` API doesn't have a `replace()` in 3.5 as it was only introduced later. Therefore we're bypassing Spark's `ViewCatalog` so that we can keep the view's history after executing a `CREATE OR REPLACE`

github-actions bot added the spark label Jan 31, 2024

nastra commented Jan 31, 2024

View reviewed changes

nastra force-pushed the spark-view-proper-replace-usage branch 2 times, most recently from 87e3247 to 45732fe Compare January 31, 2024 15:31