-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API, Core: Add Schema#withUpdatedDoc and View#updateColumnDoc APIs #9414
API, Core: Add Schema#withUpdatedDoc and View#updateColumnDoc APIs #9414
Conversation
view.replaceVersion() | ||
.withDefaultCatalog(view.currentVersion().defaultCatalog()) | ||
.withDefaultNamespace(identifier.namespace()) | ||
.withSchema(view.schema().withColumnDoc("data", "some docs for data")) | ||
.withQuery("spark", view.sqlFor("spark").sql()) | ||
.commit(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can actually take this a step further by adding an API to view , view.updateColumnDoc which returns ReplaceViewVersion
which is already initialized with namespace/catalog/query details since those will all be the same for someone who just wants to update the doc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I think that's more than just a nice to have, since we need to make sure all dialects are preserved in the replace. It's importat to have that be in the API implementation itself rather than have different engine behaviors.
4714ae9
to
f4e9965
Compare
05c71aa
to
cd7bb8d
Compare
|
||
@Override | ||
@SuppressWarnings("checkstyle:HiddenField") | ||
public ReplaceViewVersion updateColumnDoc(String name, String doc) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively we could implement UpdateSchema
for Views and the only implementation is updateColumndoc, and everything else throws Unsupported. I was -1 that since even returning UpdateSchema to callers will make it seem like some sort of schema evolution on views is possible, but it's not (and shouldn't be)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah the awkward thing is method chaining...
view.updateColumDoc("col1", "doc1").updateColumnDoc("col2, "doc2").commit()
This will only update col2 since we're not doing any change tracking or re-using any previous replaces.
Maybe this API instead can accept multiple col/docs, but probably if we go down this route it's best to just follow the existing pattern and do some change tracking
Hm....after thinking about this a bit more, I think we can remove the Schema API and add an API to ReplaceViewVersion ReplaceViewVersion can be updated to not require a query/default namespace if the operation being performed is just updating the column doc. Then the API will be
|
@amogh-jahagirdar, I agree with not adding a I'm also not sure that we want to modify existing views by updating the schema. That's a lot of validation and complexity. The schema should be produced by fairly standard SQL syntax: CREATE OR REPLACE VIEW default.event_agg (
event_count COMMENT 'Count of events',
event_date)
COMMENT 'Daily event counts'
AS
SELECT
COUNT(1), CAST(event_ts AS DATE)
FROM prod.default.events
GROUP BY 2 If we allow making schema changes, then would we also allow modifying the view by widening view columns? That line of reasoning makes me think that any change (even comments) should just produce a new version of the view. |
Oops, I didn't intend to close this! |
Yes, that's the approach in this PR currently. A new version is produced via ReplaceViewVersion (view.updateColumnDoc()) returns ReplaceViewVersion. In my next iteration, I'm going to just put the logic in the existing Replace API. That fits the operation pattern that exists throughout Iceberg better. I think creating a new version is required, since we are producing a new schema and there must be a new version to point to the new schema. The issue is that currently the only API that exists for producing a new version is The case for a new API is so that engine integrations don't need to determine all the queries/dialects/default namespace/catalog when the replace is performed for just column doc updates. More concretely, take the Trino case, which allows the following syntax:https://trino.io/docs/current/sql/comment.html
Currently, just using the replace API we'd have to implement creating the new Schema with the updated doc (the code in Schema#withUpdatedColumnDoc) and then we'd also want to preserve all the queries/dialects, default namespace, and default catalog so the integration has to pass the current state of all of those to the replace API. Here's what an implementation without any new APIs would look like
So the integration has to pass in a bunch of mostly irrelevant details into the replace just for updating the docs and if an engine integration misses a dialect to include in the replace, then updating the column doc has removed the other dialect unintentionally for the new version. Having an API in the library which takes care of these details would avoid any of these weird issues. With the new API the integration should look something like:
I think in the Spark world this doesn't really matter because it's a full replace anyways, so a new schema, new default namespace, etc all need to be determined (in the case that the schema is semantically the same, and the view text is the same though, Spark would probably want to use this API). But for engines which allow just column level doc updates, it seems error prone not to have an API for that.
I would advocate against this. These kinds of schema evolution capabilities on views don't seem that useful to me at the cost of a lot of complexity. I think there's a semantic distinction in those kinds of capabilities and adding docs. That's why I don't think it makes sense to have UpdateViewSchema implementation for views and discarded it for the current approach. |
cd7bb8d
to
f30f27f
Compare
Preconditions.checkState( | ||
representations.isEmpty(), "Cannot change column docs and replace representations"); | ||
Preconditions.checkState( | ||
null == schema, "Cannot change column docs and also explicitly change schema"); | ||
Preconditions.checkState( | ||
null == defaultNamespace, "Cannot change column docs and change default namespace"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to add some tests which exercise these Preconditions checks
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions. |
This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time. |
When working on Trino Iceberg view support, I realized that updating column comments across engines is the same logic 1.) Get the view schema
2.) Find the field to update, update it
3.) Return a new Schema (preserving all the other attributes like identifier fields, aliases)
4.) Replace the schema on the view
This PR puts this logic a separate API just so engines don't have to duplicate this logic.
Two public APIs are added:
Schema#withUpdatedDoc which creates a new schema from the instance preserving the existing fields (types/ids/etc) but replacing a field's column document.
View#updateColumnDoc which will return a ReplaceViewVersion which when successfully commited will create a new version with the replaced column doc and preserve everything else in the current version
I don't think this matters for Spark since the way to update column comments is doing a
REPLACE VIEW
which will need to visit the schema separately for any changes, but Trino + other engines support operations which just update column comments, and I think having an Iceberg API which just does that would be convenient and avoid any bugs in failing to preserve the existing version's contents.