Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API, Core: Add Schema#withUpdatedDoc and View#updateColumnDoc APIs #9414

Closed

Conversation

amogh-jahagirdar
Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar commented Jan 5, 2024

When working on Trino Iceberg view support, I realized that updating column comments across engines is the same logic 1.) Get the view schema
2.) Find the field to update, update it
3.) Return a new Schema (preserving all the other attributes like identifier fields, aliases)
4.) Replace the schema on the view

This PR puts this logic a separate API just so engines don't have to duplicate this logic.

Two public APIs are added:

Schema#withUpdatedDoc which creates a new schema from the instance preserving the existing fields (types/ids/etc) but replacing a field's column document.

View#updateColumnDoc which will return a ReplaceViewVersion which when successfully commited will create a new version with the replaced column doc and preserve everything else in the current version

I don't think this matters for Spark since the way to update column comments is doing a REPLACE VIEW
which will need to visit the schema separately for any changes, but Trino + other engines support operations which just update column comments, and I think having an Iceberg API which just does that would be convenient and avoid any bugs in failing to preserve the existing version's contents.

Comment on lines 1760 to 1765
view.replaceVersion()
.withDefaultCatalog(view.currentVersion().defaultCatalog())
.withDefaultNamespace(identifier.namespace())
.withSchema(view.schema().withColumnDoc("data", "some docs for data"))
.withQuery("spark", view.sqlFor("spark").sql())
.commit();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can actually take this a step further by adding an API to view , view.updateColumnDoc which returns ReplaceViewVersion which is already initialized with namespace/catalog/query details since those will all be the same for someone who just wants to update the doc.

Copy link
Contributor Author

@amogh-jahagirdar amogh-jahagirdar Jan 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I think that's more than just a nice to have, since we need to make sure all dialects are preserved in the replace. It's importat to have that be in the API implementation itself rather than have different engine behaviors.

@amogh-jahagirdar amogh-jahagirdar force-pushed the schema-with-docs-api branch 2 times, most recently from 4714ae9 to f4e9965 Compare January 5, 2024 06:03
@amogh-jahagirdar amogh-jahagirdar changed the title API, Core: Add withUpdatedColumnDoc API to Schema API, Core: Add Schema#withUpdatedDoc and View#updateColumnDoc APIs Jan 5, 2024
@amogh-jahagirdar amogh-jahagirdar force-pushed the schema-with-docs-api branch 2 times, most recently from 05c71aa to cd7bb8d Compare January 5, 2024 06:22

@Override
@SuppressWarnings("checkstyle:HiddenField")
public ReplaceViewVersion updateColumnDoc(String name, String doc) {
Copy link
Contributor Author

@amogh-jahagirdar amogh-jahagirdar Jan 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively we could implement UpdateSchema for Views and the only implementation is updateColumndoc, and everything else throws Unsupported. I was -1 that since even returning UpdateSchema to callers will make it seem like some sort of schema evolution on views is possible, but it's not (and shouldn't be)

Copy link
Contributor Author

@amogh-jahagirdar amogh-jahagirdar Jan 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah the awkward thing is method chaining...

view.updateColumDoc("col1", "doc1").updateColumnDoc("col2, "doc2").commit()

This will only update col2 since we're not doing any change tracking or re-using any previous replaces.

Maybe this API instead can accept multiple col/docs, but probably if we go down this route it's best to just follow the existing pattern and do some change tracking

@amogh-jahagirdar
Copy link
Contributor Author

amogh-jahagirdar commented Jan 5, 2024

Hm....after thinking about this a bit more, I think we can remove the Schema API and add an API to ReplaceViewVersion withUpdatedColumnDoc. This API can do change tracking and then do all the schema changing logic internally. The Schema API isn't really needed to be public since this logic really is only needed for views.

ReplaceViewVersion can be updated to not require a query/default namespace if the operation being performed is just updating the column doc.

Then the API will be

view.replaceViewVersion().withColumnDoc("col1", "doc1").withColumnDoc("col2", "doc2").commit()

@rdblue
Copy link
Contributor

rdblue commented Jan 5, 2024

@amogh-jahagirdar, I agree with not adding a with method to modify a schema. If we can avoid making that kind of API addition I would prefer it.

I'm also not sure that we want to modify existing views by updating the schema. That's a lot of validation and complexity. The schema should be produced by fairly standard SQL syntax:

CREATE OR REPLACE VIEW default.event_agg (
    event_count COMMENT 'Count of events',
    event_date)
COMMENT 'Daily event counts'
AS
SELECT
    COUNT(1), CAST(event_ts AS DATE)
FROM prod.default.events
GROUP BY 2

If we allow making schema changes, then would we also allow modifying the view by widening view columns? That line of reasoning makes me think that any change (even comments) should just produce a new version of the view.

@rdblue rdblue closed this Jan 5, 2024
@rdblue rdblue reopened this Jan 5, 2024
@rdblue
Copy link
Contributor

rdblue commented Jan 5, 2024

Oops, I didn't intend to close this!

@amogh-jahagirdar
Copy link
Contributor Author

amogh-jahagirdar commented Jan 5, 2024

That line of reasoning makes me think that any change (even comments) should just produce a new version of the view.

Yes, that's the approach in this PR currently. A new version is produced via ReplaceViewVersion (view.updateColumnDoc()) returns ReplaceViewVersion. In my next iteration, I'm going to just put the logic in the existing Replace API. That fits the operation pattern that exists throughout Iceberg better.

I think creating a new version is required, since we are producing a new schema and there must be a new version to point to the new schema. The issue is that currently the only API that exists for producing a new version is replace which requires users to pass in query/dialect pair(s), default namespace, schema, etc.

The case for a new API is so that engine integrations don't need to determine all the queries/dialects/default namespace/catalog when the replace is performed for just column doc updates.

More concretely, take the Trino case, which allows the following syntax:https://trino.io/docs/current/sql/comment.html

COMMENT ON view.col IS 'some col'

Currently, just using the replace API we'd have to implement creating the new Schema with the updated doc (the code in Schema#withUpdatedColumnDoc) and then we'd also want to preserve all the queries/dialects, default namespace, and default catalog so the integration has to pass the current state of all of those to the replace API. Here's what an implementation without any new APIs would look like

public void updateColumn(String col, String doc) {
      View view = loadView(...)

      Schema updatedSchema = newSchemaWithColumnDoc(view.schema(), col, doc);
      view.replaceVersion()
            .withSchema(updatedSchema)
            .withQuery(allCurrentDialects, allTheirQueries)
           .withDefaultCatalog(view.currentVersion().defaultCatalog())
           .withNamespace(view.currentVersion().namespace())
            .commit()
}

private Schema newSchemaWithColumnDoc(Schema schema, String col, String doc) {
   NestedField fieldToUpdate = findField(name);
    Preconditions.checkArgument(fieldToUpdate != null, "Field %s does not exist", name);
    NestedField updatedField =
        NestedField.of(
            fieldToUpdate.fieldId(),
            fieldToUpdate.isOptional(),
            fieldToUpdate.name(),
            fieldToUpdate.type(),
            doc);

    List<NestedField> newFields = Lists.newArrayList();
    newFields.add(updatedField);

    for (NestedField field : columns()) {
      if (field.fieldId() != updatedField.fieldId()) {
        newFields.add(field);
      }
    }

    return new Schema(newFields, getAliases(), identifierFieldIds());

}

So the integration has to pass in a bunch of mostly irrelevant details into the replace just for updating the docs and if an engine integration misses a dialect to include in the replace, then updating the column doc has removed the other dialect unintentionally for the new version. Having an API in the library which takes care of these details would avoid any of these weird issues.

With the new API the integration should look something like:


public void updateColumnDoc(String col, String doc) {
     View view = loadView(...)
      view.replaceVersion()
           .withColumnDoc(col, doc)
            .commit()
}
 

I think in the Spark world this doesn't really matter because it's a full replace anyways, so a new schema, new default namespace, etc all need to be determined (in the case that the schema is semantically the same, and the view text is the same though, Spark would probably want to use this API). But for engines which allow just column level doc updates, it seems error prone not to have an API for that.

If we allow making schema changes, then would we also allow modifying the view by widening view columns?

I would advocate against this. These kinds of schema evolution capabilities on views don't seem that useful to me at the cost of a lot of complexity. I think there's a semantic distinction in those kinds of capabilities and adding docs. That's why I don't think it makes sense to have UpdateViewSchema implementation for views and discarded it for the current approach.

Comment on lines +75 to +80
Preconditions.checkState(
representations.isEmpty(), "Cannot change column docs and replace representations");
Preconditions.checkState(
null == schema, "Cannot change column docs and also explicitly change schema");
Preconditions.checkState(
null == defaultNamespace, "Cannot change column docs and change default namespace");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to add some tests which exercise these Preconditions checks

Copy link

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Oct 13, 2024
Copy link

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

@github-actions github-actions bot closed this Oct 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants