API, Core: Add Schema#withUpdatedDoc and View#updateColumnDoc APIs #9414

amogh-jahagirdar · 2024-01-05T05:37:32Z

When working on Trino Iceberg view support, I realized that updating column comments across engines is the same logic 1.) Get the view schema
2.) Find the field to update, update it
3.) Return a new Schema (preserving all the other attributes like identifier fields, aliases)
4.) Replace the schema on the view

This PR puts this logic a separate API just so engines don't have to duplicate this logic.

Two public APIs are added:

Schema#withUpdatedDoc which creates a new schema from the instance preserving the existing fields (types/ids/etc) but replacing a field's column document.

View#updateColumnDoc which will return a ReplaceViewVersion which when successfully commited will create a new version with the replaced column doc and preserve everything else in the current version

I don't think this matters for Spark since the way to update column comments is doing a REPLACE VIEW
which will need to visit the schema separately for any changes, but Trino + other engines support operations which just update column comments, and I think having an Iceberg API which just does that would be convenient and avoid any bugs in failing to preserve the existing version's contents.

amogh-jahagirdar · 2024-01-05T05:41:32Z

core/src/test/java/org/apache/iceberg/view/ViewCatalogTests.java

+    view.replaceVersion()
+        .withDefaultCatalog(view.currentVersion().defaultCatalog())
+        .withDefaultNamespace(identifier.namespace())
+        .withSchema(view.schema().withColumnDoc("data", "some docs for data"))
+        .withQuery("spark", view.sqlFor("spark").sql())
+        .commit();


We can actually take this a step further by adding an API to view , view.updateColumnDoc which returns ReplaceViewVersion which is already initialized with namespace/catalog/query details since those will all be the same for someone who just wants to update the doc.

Actually I think that's more than just a nice to have, since we need to make sure all dialects are preserved in the replace. It's importat to have that be in the API implementation itself rather than have different engine behaviors.

amogh-jahagirdar · 2024-01-05T06:32:14Z

core/src/main/java/org/apache/iceberg/view/BaseView.java

+
+  @Override
+  @SuppressWarnings("checkstyle:HiddenField")
+  public ReplaceViewVersion updateColumnDoc(String name, String doc) {


Alternatively we could implement UpdateSchema for Views and the only implementation is updateColumndoc, and everything else throws Unsupported. I was -1 that since even returning UpdateSchema to callers will make it seem like some sort of schema evolution on views is possible, but it's not (and shouldn't be)

Ah the awkward thing is method chaining...

view.updateColumDoc("col1", "doc1").updateColumnDoc("col2, "doc2").commit()

This will only update col2 since we're not doing any change tracking or re-using any previous replaces.

Maybe this API instead can accept multiple col/docs, but probably if we go down this route it's best to just follow the existing pattern and do some change tracking

amogh-jahagirdar · 2024-01-05T07:59:39Z

Hm....after thinking about this a bit more, I think we can remove the Schema API and add an API to ReplaceViewVersion withUpdatedColumnDoc. This API can do change tracking and then do all the schema changing logic internally. The Schema API isn't really needed to be public since this logic really is only needed for views.

ReplaceViewVersion can be updated to not require a query/default namespace if the operation being performed is just updating the column doc.

Then the API will be

view.replaceViewVersion().withColumnDoc("col1", "doc1").withColumnDoc("col2", "doc2").commit()

rdblue · 2024-01-05T17:06:21Z

@amogh-jahagirdar, I agree with not adding a with method to modify a schema. If we can avoid making that kind of API addition I would prefer it.

I'm also not sure that we want to modify existing views by updating the schema. That's a lot of validation and complexity. The schema should be produced by fairly standard SQL syntax:

CREATE OR REPLACE VIEW default.event_agg (
    event_count COMMENT 'Count of events',
    event_date)
COMMENT 'Daily event counts'
AS
SELECT
    COUNT(1), CAST(event_ts AS DATE)
FROM prod.default.events
GROUP BY 2

If we allow making schema changes, then would we also allow modifying the view by widening view columns? That line of reasoning makes me think that any change (even comments) should just produce a new version of the view.

rdblue · 2024-01-05T17:24:01Z

Oops, I didn't intend to close this!

amogh-jahagirdar · 2024-01-05T20:51:43Z

That line of reasoning makes me think that any change (even comments) should just produce a new version of the view.

Yes, that's the approach in this PR currently. A new version is produced via ReplaceViewVersion (view.updateColumnDoc()) returns ReplaceViewVersion. In my next iteration, I'm going to just put the logic in the existing Replace API. That fits the operation pattern that exists throughout Iceberg better.

I think creating a new version is required, since we are producing a new schema and there must be a new version to point to the new schema. The issue is that currently the only API that exists for producing a new version is replace which requires users to pass in query/dialect pair(s), default namespace, schema, etc.

The case for a new API is so that engine integrations don't need to determine all the queries/dialects/default namespace/catalog when the replace is performed for just column doc updates.

More concretely, take the Trino case, which allows the following syntax:https://trino.io/docs/current/sql/comment.html

COMMENT ON view.col IS 'some col'

Currently, just using the replace API we'd have to implement creating the new Schema with the updated doc (the code in Schema#withUpdatedColumnDoc) and then we'd also want to preserve all the queries/dialects, default namespace, and default catalog so the integration has to pass the current state of all of those to the replace API. Here's what an implementation without any new APIs would look like

public void updateColumn(String col, String doc) {
      View view = loadView(...)

      Schema updatedSchema = newSchemaWithColumnDoc(view.schema(), col, doc);
      view.replaceVersion()
            .withSchema(updatedSchema)
            .withQuery(allCurrentDialects, allTheirQueries)
           .withDefaultCatalog(view.currentVersion().defaultCatalog())
           .withNamespace(view.currentVersion().namespace())
            .commit()
}

private Schema newSchemaWithColumnDoc(Schema schema, String col, String doc) {
   NestedField fieldToUpdate = findField(name);
    Preconditions.checkArgument(fieldToUpdate != null, "Field %s does not exist", name);
    NestedField updatedField =
        NestedField.of(
            fieldToUpdate.fieldId(),
            fieldToUpdate.isOptional(),
            fieldToUpdate.name(),
            fieldToUpdate.type(),
            doc);

    List<NestedField> newFields = Lists.newArrayList();
    newFields.add(updatedField);

    for (NestedField field : columns()) {
      if (field.fieldId() != updatedField.fieldId()) {
        newFields.add(field);
      }
    }

    return new Schema(newFields, getAliases(), identifierFieldIds());

}

So the integration has to pass in a bunch of mostly irrelevant details into the replace just for updating the docs and if an engine integration misses a dialect to include in the replace, then updating the column doc has removed the other dialect unintentionally for the new version. Having an API in the library which takes care of these details would avoid any of these weird issues.

With the new API the integration should look something like:


public void updateColumnDoc(String col, String doc) {
     View view = loadView(...)
      view.replaceVersion()
           .withColumnDoc(col, doc)
            .commit()
}

I think in the Spark world this doesn't really matter because it's a full replace anyways, so a new schema, new default namespace, etc all need to be determined (in the case that the schema is semantically the same, and the view text is the same though, Spark would probably want to use this API). But for engines which allow just column level doc updates, it seems error prone not to have an API for that.

If we allow making schema changes, then would we also allow modifying the view by widening view columns?

I would advocate against this. These kinds of schema evolution capabilities on views don't seem that useful to me at the cost of a lot of complexity. I think there's a semantic distinction in those kinds of capabilities and adding docs. That's why I don't think it makes sense to have UpdateViewSchema implementation for views and discarded it for the current approach.

amogh-jahagirdar · 2024-01-12T19:42:16Z

core/src/main/java/org/apache/iceberg/view/ViewVersionReplace.java

+      Preconditions.checkState(
+          representations.isEmpty(), "Cannot change column docs and replace representations");
+      Preconditions.checkState(
+          null == schema, "Cannot change column docs and also explicitly change schema");
+      Preconditions.checkState(
+          null == defaultNamespace, "Cannot change column docs and change default namespace");


I need to add some tests which exercise these Preconditions checks

github-actions · 2024-10-13T00:16:06Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

github-actions · 2024-10-20T00:16:44Z

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

github-actions bot added API core labels Jan 5, 2024

amogh-jahagirdar commented Jan 5, 2024

View reviewed changes

amogh-jahagirdar force-pushed the schema-with-docs-api branch 2 times, most recently from 4714ae9 to f4e9965 Compare January 5, 2024 06:03

amogh-jahagirdar changed the title ~~API, Core: Add withUpdatedColumnDoc API to Schema~~ API, Core: Add Schema#withUpdatedDoc and View#updateColumnDoc APIs Jan 5, 2024

amogh-jahagirdar force-pushed the schema-with-docs-api branch 2 times, most recently from 05c71aa to cd7bb8d Compare January 5, 2024 06:22

amogh-jahagirdar commented Jan 5, 2024

View reviewed changes

rdblue closed this Jan 5, 2024

rdblue reopened this Jan 5, 2024

API, Core: Add withColumnDoc API to ReplaceViewVersion

f30f27f

amogh-jahagirdar force-pushed the schema-with-docs-api branch from cd7bb8d to f30f27f Compare January 12, 2024 19:18

amogh-jahagirdar commented Jan 12, 2024

View reviewed changes

github-actions bot added the stale label Oct 13, 2024

github-actions bot closed this Oct 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API, Core: Add Schema#withUpdatedDoc and View#updateColumnDoc APIs #9414

API, Core: Add Schema#withUpdatedDoc and View#updateColumnDoc APIs #9414

amogh-jahagirdar commented Jan 5, 2024 •

edited

Loading

amogh-jahagirdar Jan 5, 2024

amogh-jahagirdar Jan 5, 2024 •

edited

Loading

amogh-jahagirdar Jan 5, 2024 •

edited

Loading

amogh-jahagirdar Jan 5, 2024 •

edited

Loading

amogh-jahagirdar commented Jan 5, 2024 •

edited

Loading

rdblue commented Jan 5, 2024 •

edited

Loading

rdblue commented Jan 5, 2024

amogh-jahagirdar commented Jan 5, 2024 •

edited

Loading

amogh-jahagirdar Jan 12, 2024

github-actions bot commented Oct 13, 2024

github-actions bot commented Oct 20, 2024

API, Core: Add Schema#withUpdatedDoc and View#updateColumnDoc APIs #9414

API, Core: Add Schema#withUpdatedDoc and View#updateColumnDoc APIs #9414

Conversation

amogh-jahagirdar commented Jan 5, 2024 • edited Loading

amogh-jahagirdar Jan 5, 2024

Choose a reason for hiding this comment

amogh-jahagirdar Jan 5, 2024 • edited Loading

Choose a reason for hiding this comment

amogh-jahagirdar Jan 5, 2024 • edited Loading

Choose a reason for hiding this comment

amogh-jahagirdar Jan 5, 2024 • edited Loading

Choose a reason for hiding this comment

amogh-jahagirdar commented Jan 5, 2024 • edited Loading

rdblue commented Jan 5, 2024 • edited Loading

rdblue commented Jan 5, 2024

amogh-jahagirdar commented Jan 5, 2024 • edited Loading

amogh-jahagirdar Jan 12, 2024

Choose a reason for hiding this comment

github-actions bot commented Oct 13, 2024

github-actions bot commented Oct 20, 2024

amogh-jahagirdar commented Jan 5, 2024 •

edited

Loading

amogh-jahagirdar Jan 5, 2024 •

edited

Loading

amogh-jahagirdar Jan 5, 2024 •

edited

Loading

amogh-jahagirdar Jan 5, 2024 •

edited

Loading

amogh-jahagirdar commented Jan 5, 2024 •

edited

Loading

rdblue commented Jan 5, 2024 •

edited

Loading

amogh-jahagirdar commented Jan 5, 2024 •

edited

Loading