Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ConnectorMetadata#isMaterializedView #18933

Closed
wants to merge 5 commits into from

Conversation

ksobolew
Copy link
Contributor

@ksobolew ksobolew commented Sep 5, 2023

Description

The current implementation of MetadataManager#isMaterializedView calls ConnectorMetadata#getMaterializedView under the covers. This means that we have to load and deserialize the materialized view metadata just to see if it exists, which is unnecessary most of the time. This may also cause issues when the metadata is currupted in some way (see the last commit for relevant test cases, where we add support for this method to the Iceberg plugin). In such a case, we want to still be able to DROP the corrupted materialized view - which is currently impossible, because the DropMaterializedViewTask will call getMaterializedView, which will fail when it tries to load the corrupted metadata. With this change this is less likely.

Additional context and related issues

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# Section
* Fix failure in `DROP MATERIALIZED VIEW` in Iceberg when the materialized view metadata is missing or corrupted.

@cla-bot cla-bot bot added the cla-signed label Sep 5, 2023
@ksobolew ksobolew marked this pull request as ready for review September 5, 2023 13:49
@github-actions github-actions bot added tests:hive hive Hive connector labels Sep 5, 2023
@github-actions github-actions bot added the jdbc Relates to Trino JDBC driver label Sep 6, 2023
@ksobolew
Copy link
Contributor Author

ksobolew commented Sep 6, 2023

CI #16437

@github-actions github-actions bot added the iceberg Iceberg connector label Sep 7, 2023
@ksobolew ksobolew requested review from findepi and hashhar September 12, 2023 14:25
@findepi
Copy link
Member

findepi commented Sep 13, 2023

And that would prevent it from being DROPped, because the DropMaterializedViewTask first wants to check if the entity to be dropped is a materialized view. If it called getMaterializedView, as it used to, dropping such a view would fail too; but with isMaterializedView that is not a problem.

please add a test for that

This may not be just an optimization,

is this also an optimization? if so, there should be a test change (eg counting file system or metastore access) that shows the improvement.

@findepi
Copy link
Member

findepi commented Sep 13, 2023

(x) This is not user-visible or is docs only, and no release notes are required.

The description contradicts that.

@ksobolew
Copy link
Contributor Author

is this also an optimization? if so, there should be a test change (eg counting file system or metastore access) that shows the improvement.

There are already changes to the TestTrinoDatabaseMetaData that show changes in the call patterns. It's not exactly fewer calls, but fewer to getMaterializedView.

@findepi
Copy link
Member

findepi commented Sep 15, 2023

TestTrinoDatabaseMetaData is quite tied to JDBC driver
can we have a test for Iceberg connector in the Iceberg module as well?
I think it would make sense, given that Iceberg is the only connector that has MV support

see eg

@Test(dataProvider = "metadataQueriesTestTableCountDataProvider")
public void testInformationSchemaColumns(int tables)
{
String schemaName = "test_i_s_columns_schema" + randomNameSuffix();
assertUpdate("CREATE SCHEMA " + schemaName);
Session session = Session.builder(getSession())
.setSchema(schemaName)
.build();
for (int i = 0; i < tables; i++) {
assertUpdate(session, "CREATE TABLE test_select_i_s_columns" + i + "(id varchar, age integer)");
// Produce multiple snapshots and metadata files
assertUpdate(session, "INSERT INTO test_select_i_s_columns" + i + " VALUES ('abc', 11)", 1);
assertUpdate(session, "INSERT INTO test_select_i_s_columns" + i + " VALUES ('xyz', 12)", 1);
assertUpdate(session, "CREATE TABLE test_other_select_i_s_columns" + i + "(id varchar, age integer)"); // won't match the filter
}
// Bulk retrieval
assertMetastoreInvocations(session, "SELECT * FROM information_schema.columns WHERE table_schema = CURRENT_SCHEMA AND table_name LIKE 'test_select_i_s_columns%'",
ImmutableMultiset.builder()
.add(GET_ALL_TABLES_FROM_DATABASE)
.addCopies(GET_TABLE, tables * 2)
.addCopies(GET_TABLES_WITH_PARAMETER, 2)
.build());
// Pointed lookup
assertMetastoreInvocations(session, "SELECT * FROM information_schema.columns WHERE table_schema = CURRENT_SCHEMA AND table_name = 'test_select_i_s_columns0'",
ImmutableMultiset.builder()
.add(GET_TABLE)
.build());
// Pointed lookup via DESCRIBE (which does some additional things before delegating to information_schema.columns)
assertMetastoreInvocations(session, "DESCRIBE test_select_i_s_columns0",
ImmutableMultiset.builder()
.add(GET_DATABASE)
.add(GET_TABLE)
.build());
for (int i = 0; i < tables; i++) {
assertUpdate(session, "DROP TABLE test_select_i_s_columns" + i);
assertUpdate(session, "DROP TABLE test_other_select_i_s_columns" + i);
}
}

@ksobolew
Copy link
Contributor Author

Sure, I'm working on tests right now (though demonstrating how a corrupted metadata can prevent DROP MV from executing is kinda challenging)

The current implementation of `MetadataManager#isMaterializedView` calls
`ConnectorMetadata#getMaterializedView` under the covers. This means
that we have to load and deserialize the materialized view metadata just
to see if it exists, which is unnecessary most of the time. This may
also cause issues when the metadata is currupted in some way (see
subsequent commits for relevant test cases, where we add support for
this method to the Iceberg plugin). In such a case, we want to still be
able to `DROP` the corrupted materialized view - which is currently
impossible, because the `DropMaterializedViewTask` will call
`getMaterializedView`, which will fail when it tries to load the
corrupted metadata. With this change this is less likely.
Here it's mostly just an optimization, which allows us to avoid loading
the materialized view metadata unnecessarily in some situations.
@ksobolew
Copy link
Contributor Author

@findepi Addressed comments, PTAL.

@ksobolew
Copy link
Contributor Author

TestTrinoDatabaseMetaData is quite tied to JDBC driver
can we have a test for Iceberg connector in the Iceberg module as well?
I think it would make sense, given that Iceberg is the only connector that has MV support

Added such a test, but it's not very meaningful, IMO. It just says that we call getTable on the metastore one time less when we DROP MATERIALIZED VIEW.

@ksobolew
Copy link
Contributor Author

(x) This is not user-visible or is docs only, and no release notes are required.

The description contradicts that.

Added a release note.

@ksobolew
Copy link
Contributor Author

ksobolew commented Sep 28, 2023

And that would prevent it from being DROPped, because the DropMaterializedViewTask first wants to check if the entity to be dropped is a materialized view. If it called getMaterializedView, as it used to, dropping such a view would fail too; but with isMaterializedView that is not a problem.

please add a test for that

Finally found a way to do that; test added.

This may not be just an optimization,

is this also an optimization? if so, there should be a test change (eg counting file system or metastore access) that shows the improvement.

I rewrote the commit message (and the PR description) to phrase it as primarily a correctness issue, with some efficiency improvements added as a bonus.

@ksobolew
Copy link
Contributor Author

See #19177 for an example of a failure that this PR can mitigate.

This commit adds, among other things, test cases where we `DROP`
materialized view with corrupted metadata, which would otherwise fail.

In case of Hive catalogs we can bypass the metadata cache entirely, in
case of Glue catalogs the benefits are not that pronounced. But we still
avoid some potentially expensive operations and/or ones potentially
causing failures.
@findepi
Copy link
Member

findepi commented Sep 29, 2023

/test-with-secrets sha=f5d848b3678109c4fe1e67b18b798d5c0e1ee04e

@findepi
Copy link
Member

findepi commented Sep 29, 2023

i plan to review this (later)

@github-actions
Copy link

github-actions bot commented Sep 29, 2023

The CI workflow run with tests that require additional secrets finished as failure: https://github.com/trinodb/trino/actions/runs/6352010690

@findepi
Copy link
Member

findepi commented Sep 29, 2023

This means that we have to load and deserialize the materialized view metadata just to see if it exists, which is unnecessary most of the time

This is true, but actually it is probably much lesser problem than it sounds it is

. This may also cause issues when the metadata is currupted in some way (see the last commit for relevant test cases, where we add support for this method to the Iceberg plugin). In such a case, we want to still be able to DROP the corrupted materialized view

The ConnectorMaterializedViewDefinition is the "materialized view definition", it shouldn't need to load the storage table for anything.
I think the fact that in Iceberg case it loads properties from the storage table is sub-optimal design and we should revisit it.

This is important not only for DROP corrupted_mv case.
It is also important for all metadata queries. Querying system.metatda.materialized_views should ideally not need to go to S3 to obtain "materialized view definitions", especially not O(number of MVs) times.

@ksobolew
Copy link
Contributor Author

ksobolew commented Oct 2, 2023

This is true, but actually it is probably much lesser problem than it sounds it is

It is a problem, at least, for the use case I'm trying to fix.

This is important not only for DROP corrupted_mv case.
It is also important for all metadata queries. Querying system.metatda.materialized_views should ideally not need to go to S3 to obtain "materialized view definitions", especially not O(number of MVs) times.

That's true as well.

I appreciate the feedback and I agree that we don't want to add more code which may be considered a workaround for limitations of the current architecture, but I also don't want this to become a case of "sure, but first please spend 6 months refactoring this complex piece of code". What would be the commended course of action then?

Copy link

This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua

@github-actions github-actions bot added the stale label Jan 11, 2024
@mosabua
Copy link
Member

mosabua commented Jan 11, 2024

Reminder to continue work on this @ksobolew @findepi or close it.

@ksobolew
Copy link
Contributor Author

This work is currently on hold and it's not clear this is the way forward. I'll reopen if it crystalizes.

@ksobolew ksobolew closed this Jan 12, 2024
@ksobolew ksobolew deleted the kudi/is-mv branch May 20, 2024 12:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed hive Hive connector iceberg Iceberg connector jdbc Relates to Trino JDBC driver stale
Development

Successfully merging this pull request may close these issues.

3 participants