Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark: Throw exception on ALTER VIEW <viewName> AS <query> #9510

Merged
merged 1 commit into from
Feb 1, 2024

Conversation

nastra
Copy link
Contributor

@nastra nastra commented Jan 18, 2024

ALTER VIEW <viewName> AS <query> doesn't allow to preserve column aliases and column comment when the underlying query is changed, which can lead to unexpected behavior. For now it's better to use CREATE OR REPLACE VIEW as that is more explicit when the schema of the view is defined with column aliases/comments.

PS: Tests will pass once the RewriteViewCommands changes from #9582 are merged

@github-actions github-actions bot added the spark label Jan 18, 2024
@nastra nastra added this to the Iceberg 1.5.0 milestone Jan 18, 2024
@nastra nastra force-pushed the spark-view-alter-support branch from 6394bb5 to fb714a6 Compare January 18, 2024 12:07
@nastra nastra force-pushed the spark-view-alter-support branch 4 times, most recently from ebfb691 to 477bd67 Compare January 25, 2024 11:32
@nastra nastra force-pushed the spark-view-alter-support branch 2 times, most recently from 823df91 to 907a5f1 Compare January 26, 2024 17:11
@@ -36,6 +38,9 @@ object CheckViews extends (LogicalPlan => Unit) {
verifyColumnCount(ident, columnAliases, query)
SchemaUtils.checkColumnNameDuplication(query.schema.fieldNames, SQLConf.get.resolver)

case AlterViewAs(ResolvedV2View(_, _, _), _, query) =>
SchemaUtils.checkColumnNameDuplication(query.schema.fieldNames, SQLConf.get.resolver)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't have to verify column counts here because ALTER doesn't support specifying column aliases

Copy link
Contributor

@rdblue rdblue Jan 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This worries me. Are the aliases from the original create expected to be applied to the new view version? I don't think that the expected behavior is for the column comments and aliases to go away.

For example, if I ran these two statements, what should happen?

-- original create command
CREATE VIEW daily_signups (
    signup_count COMMENT 'Number of signups',
    day COMMENT 'Signup date') AS
SELECT count(1), day FROM signups GROUP BY day

-- updated view with DISTINCT to deduplicate
ALTER VIEW daily_signups AS
SELECT count(distinct email), day FROM signups GROUP BY day

To me, it seems like this should work and result in the same view schema, with signup_count as the first column and the column docs preserved. That probably means that we need to store the column names and comments separately after all, or at least flag that they were set somewhere (view properties?).

Also, it looks like the indentation is inconsistent between this and other cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this is a weird case and I also had a different expectation when initially running into this. That's why I added the alterViewWithUpdatedQueryColumns() test to make this behavior explicit.
I also looking into how V1 views behave here and they do the same thing (aka losing the column comments from the original CREATE):

spark-sql (default)> create view iceberg1.v10 (x COMMENT 'comment 1', y COMMENT 'comment 2') AS SELECT count(id), zip from iceberg1.foo group by zip;

spark-sql (default)> show create table iceberg1.v10;
CREATE VIEW iceberg1.v10 (
  x COMMENT 'comment 1',
  y COMMENT 'comment 2')
TBLPROPERTIES (
  'transient_lastDdlTime' = '1706538575')
AS SELECT count(id) AS cnt, zip from iceberg1.foo group by zip

Time taken: 0.055 seconds, Fetched 1 row(s)

spark-sql (default)> describe extended iceberg1.v10;
x                   	bigint              	comment 1           
y                   	int                 	comment 2           
                    	                    	                    
# Detailed Table Information	                    	                    
Catalog             	spark_catalog       	                    
Database            	iceberg1            	                    
Table               	v10                 	                    
Owner               	nastra              	                    
Created Time        	Mon Jan 29 15:29:35 CET 2024	                    
Last Access         	UNKNOWN             	                    
Created By          	Spark 3.5.0         	                    
Type                	VIEW                	                    
View Text           	SELECT count(id) AS cnt, zip from iceberg1.foo group by zip	                    
View Original Text  	SELECT count(id) AS cnt, zip from iceberg1.foo group by zip	                    
View Catalog and Namespace	spark_catalog.default	                    
View Query Output Columns	[cnt, zip]          	                    
Table Properties    	[transient_lastDdlTime=1706538575]	                    
Serde Library       	org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe	                    
InputFormat         	org.apache.hadoop.mapred.SequenceFileInputFormat	                    
OutputFormat        	org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat	                    
Storage Properties  	[serialization.format=1]	                    
Time taken: 0.063 seconds, Fetched 21 row(s)
spark-sql (default)> alter view iceberg1.v10 AS SELECT count(zip) AS cnt, zip from iceberg1.foo group by zip;
Time taken: 0.213 seconds

spark-sql (default)> show create table iceberg1.v10;
CREATE VIEW iceberg1.v10 (
  cnt,
  zip)
TBLPROPERTIES (
  'transient_lastDdlTime' = '1706538634')
AS SELECT count(zip) AS cnt, zip from iceberg1.foo group by zip

Time taken: 0.029 seconds, Fetched 1 row(s)


spark-sql (default)> describe extended iceberg1.v10;
cnt                 	bigint              	                    
zip                 	int                 	                    
                    	                    	                    
# Detailed Table Information	                    	                    
Catalog             	spark_catalog       	                    
Database            	iceberg1            	                    
Table               	v10                 	                    
Owner               	nastra              	                    
Created Time        	Mon Jan 29 15:29:35 CET 2024	                    
Last Access         	UNKNOWN             	                    
Created By          	Spark 3.5.0         	                    
Type                	VIEW                	                    
View Text           	SELECT count(zip) AS cnt, zip from iceberg1.foo group by zip	                    
View Original Text  	SELECT count(zip) AS cnt, zip from iceberg1.foo group by zip	                    
View Catalog and Namespace	spark_catalog.default	                    
View Query Output Columns	[cnt, zip]          	                    
Table Properties    	[transient_lastDdlTime=1706538634]	                    
Serde Library       	org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe	                    
InputFormat         	org.apache.hadoop.mapred.SequenceFileInputFormat	                    
OutputFormat        	org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat	                    
Storage Properties  	[serialization.format=1]	                    
Time taken: 0.06 seconds, Fetched 21 row(s)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To summarize this topic, we discussed this offline and decided that we should not support ALTER VIEW ... AS for now and rather throw an exception (indicating to use CREATE OR REPLACE VIEW), as it leads to surprising behaviors that column aliases/comments are lost after updating the underlying SQL.


assertThatThrownBy(() -> sql("ALTER VIEW %s AS %s", viewName, sql))
.isInstanceOf(AnalysisException.class)
.hasMessageContaining("Cannot create the persistent object")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the error message in Spark unfortunately doesn't distinguish between create/alter cases and throws that error for V1 views as well

@nastra nastra force-pushed the spark-view-alter-support branch from 907a5f1 to 5c9bf1b Compare January 26, 2024 17:29
@@ -123,4 +132,17 @@ case class RewriteViewCommands(spark: SparkSession) extends Rule[LogicalPlan] wi

collectTempViews(child)
}

private object IsIcebergView {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Is for this name doesn't make sense because it doesn't produce a boolean. It produces a loaded v2 view. I don't even think that the v2 view needs to be an Iceberg view.

I think this should probably be renamed to ResolvedView and the other should be named ResolvedIdentifier or something. I also don't like the duplication across these two helpers. It seems like this is doing what the other rule does, but should instead be delegating to other rules.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, could this unapply method be added to ResolvedView?

Copy link
Contributor Author

@nastra nastra Jan 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like those new names better. I've updated the existing one to ResolvedIdent as otherwise it clashes with the existing ResolvedIdentifier. Unfortunately both unapply methods can't be combined into ResolvedView

@@ -60,17 +63,23 @@ case class RewriteViewCommands(spark: SparkSession) extends Rule[LogicalPlan] wi
properties = properties,
allowExisting = allowExisting,
replace = replace)

case a@AlterViewAs(ResolvedV2View(_, ident, _), _, query) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like that this depends on the next case running before this one. I think that I would combine the two.

I think this should be:

    case a@AlterViewAs(ResolvedView(resolved), _, query) =>
      val q = CTESubstitution.apply(query)
      verifyTemporaryObjectsDontExist(ident, q)
      a.copy(child=resolved, query = q)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we want to go this route, then we'd also need to add the same handling for SetViewProperties(UnresolvedView(...)) / UnsetViewProperties(UnresolvedView(...)), hence it seemed better to me to just handle case u@UnresolvedView(ResolvedView(resolved), _, _, _) separately.
I'm fine either way, so let me know if you'd still like me to switch to the approach you proposed.

Copy link
Contributor

@rdblue rdblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty good for the properties cases, but I think we have a major question around the ALTER VIEW ... AS case. @nastra, can you separate those two implementations? There's no reason to mix them together and we can move forward with the view property updates quickly.

@nastra
Copy link
Contributor Author

nastra commented Jan 29, 2024

@rdblue I have extracted setting/unsetting view properties into #9582

@nastra nastra force-pushed the spark-view-alter-support branch 2 times, most recently from 5c36b4d to 9c6bcf1 Compare January 30, 2024 17:39
@nastra nastra changed the title Spark: Support altering views Spark: Throw exception on ALTER VIEW <viewName> AS <query> Jan 30, 2024
@@ -60,6 +62,9 @@ case class RewriteViewCommands(spark: SparkSession) extends Rule[LogicalPlan] wi
properties = properties,
allowExisting = allowExisting,
replace = replace)

case AlterViewAs(ResolvedV2View(_, _), _, _) =>
throw new AnalysisException("ALTER VIEW <viewName> AS is not supported. Use CREATE OR REPLACE VIEW instead")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to happen here instead of in CheckViews so that Spark doesn't throw a different analysis exception?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I threw the error here mainly because when we supported AlterViewAs, most of the things had to happen in RewriteViewCommands but I've moved this now to CheckViews

Copy link
Contributor

@rdblue rdblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

Throwing an exception in RewriteViewCommands is a little suspicious, but I think that's to avoid Spark throwing exceptions in rules rather than in the validation phase so it's okay.

`ALTER VIEW <viewName> AS <query>` doesn't allow to preserve column
aliases and column comment when the underlying query is changed, which
can lead to unexpected behavior. For now it's better to use `CREATE OR REPLACE VIEW`
as that is more explicit when the schema of the view is defined with
column aliases/comments.
@nastra nastra force-pushed the spark-view-alter-support branch from 9c6bcf1 to b646058 Compare February 1, 2024 08:21
@nastra nastra merged commit 66d4cf6 into apache:main Feb 1, 2024
31 checks passed
@nastra nastra deleted the spark-view-alter-support branch February 1, 2024 09:11
devangjhabakh pushed a commit to cdouglas/iceberg that referenced this pull request Apr 22, 2024
…9510)

`ALTER VIEW <viewName> AS <query>` doesn't allow to preserve column
aliases and column comment when the underlying query is changed, which
can lead to unexpected behavior. For now it's better to use `CREATE OR REPLACE VIEW`
as that is more explicit when the schema of the view is defined with
column aliases/comments.
@nastra nastra self-assigned this Nov 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

2 participants