Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: temporarily disconnect metadata db during long analytics queries #31315

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

mistercrunch
Copy link
Member

@mistercrunch mistercrunch commented Dec 5, 2024

The number of connection to the metadata database, in our case postgres, is limited, and when we have a lot of pods autoscale, at some point we can hit the maximum postgres connection, upon which point new pods/request can't get a new connection. Typically this happens when a database like Redshift or Presto is queuing up, and requests to the analytics database are hanging. People get impatient and force-refresh their dashboards, which make it even worst, piling up lots of web threads that are all just waiting for analytics db, and each one is hogging a metadata database connection.

This PR add a new feature flag (false by default) DISABLE_METADATA_DB_DURING_ANALYTICS, that uses a context manager to disconnect and reconnect to the database during long blocking operations, like waiting for an analytics query to return a result. This works only in conjunction with NullPool for now, but could be extended to work with various pool configurations.

Also, for convenience, adding another SIMULATE_SLOW_ANALYTICS_DATABASE feature flag that introduces a 60 seconds sleep to make it easier to test this.

Note that net-net this PR it's a no-op with the default feature flags.

One question is whether we scrap this PR and chain it behind a large refactor, call it "untangle and centralize all analytics database interaction in the superset/db_engine_spec/ package". Goal there would be to move all analytics database operations in one place. One issue is that database interactions are somewhat tangled with Superset-specific logic, things like interaction with the cache manager, updating progress/status in the Query table, ... I think the way we'd handle it is by passing objects like the cache_manager to db_engine_spec, and even things like callbacks if needed.

# connection temporarily during the execution of analytics queries, avoiding bottlenecks
"DISABLE_METADATA_DB_DURING_ANALYTICS": False,
# if true, we add a one minute delay to the database queries for both Explore and SQL Lab
"SIMULATE_SLOW_ANALYTICS_DATABASE": True,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

switch to False?

df = mutator(df)
with temporarily_disconnect_db():
if is_feature_enabled("SIMULATE_SLOW_ANALYTICS_DATABASE"):
time.sleep(30)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: would be nice to have a configuration key for this value also, or just remove this functionality and rely on other types of tests, for example using charts that query pg_sleep on a PG analytics db

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I wanted it to be set as a value but also be dynamic (no need to deploy to change). The feature flag framework we have now only accepts book, and configs can't be changed on the fly...

I might just need to strip it out of this PR. Ideally we'd have a warm configs/settings framework isolated from feature flags.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think it's a bit squicky to have a production feature flag which inserts a sleep to set up some test scenario most people won't care about. I get that it makes testing easier but it'd probably be better just as a set of test instructions, especially if you can test it with a deliberately slow query like @dpgaspar suggested

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @giftig - adding this type of testing logic is IMO not a great solution, as it convolutes the codebase and adds maintenance burden that doesn't contribute to the core functionality of the product.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it's funky, removing. I mean to issue this PR as DRAFT, let me switch it. Also having this in multiple places isn't great.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also found SELECT pg_sleep(), {...} as a different/better way to test this feature.

@@ -691,27 +731,30 @@ def _log_query(sql: str) -> None:
with self.get_raw_connection(catalog=catalog, schema=schema) as conn:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYT about introducing the temporarily_disconnect_db logic inside get_raw_connection, if possible, could introduce several benefits, for example would make sure that all analytics db queries would obey the disconnect, easier to refactor in the future

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was my original plan, but some code like Database.get_df() doesnt use get_raw_connection. Also looked at get_sqla_engine and other areas in DbEngineSpec. There's a fair amount of indirection and a deep, conditional call stack around analytics queries... One thing I found is that with context managers like Database.get_sqla_engine, we offer a lease on a connection and there's no guarantees that the code using it won't need a metadata connection...

Copy link
Member Author

@mistercrunch mistercrunch Dec 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great to refactor all the analytics DB access to got through a single focal point. I'd say a the logic related to external data access in core.models.Database and in sql_lab.py and elsewhere should be brought in DbEngineSpec. There we can make a lot of of the logic private to that package and have the rest of the codebase use higher level abstractions like DbEngineSpec.get_df and DbEngineSpec.get_data.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYT about introducing the temporarily_disconnect_db logic inside get_raw_connection

Unfortunately I don't think that it is possible, given that get_raw_connection is context manager, and that it's possible for the caller to use the metadata database in the context provided by the context manager...

@mistercrunch
Copy link
Member Author

mistercrunch commented Dec 6, 2024

Ok, so I removed the nasty SIMULATE_SLOW_ANALYTICS_DATABASE feature flag, and I'm thinking ideally we need to put in a significant a refactor around what I wrote before ->

One question is whether we scrap this PR and chain it behind a large refactor, call it "untangle and centralize all analytics database interaction in the superset/db_engine_spec/ package". Goal there would be to move all analytics database operations in one place. One issue is that database interactions are somewhat tangled with Superset-specific logic, things like interaction with the cache manager, updating progress/status in the Query table, ... I think the way we'd handle it is by passing objects like the cache_manager to db_engine_spec, and even things like callbacks if needed.

@mistercrunch
Copy link
Member Author

On our side we're going to deploy this to a staging environment to run extensive tests around it. Not sure if it's mergeable as is or whether we want to keep this out of master until the larger refactor I mentioned.

Given it's a noop, that I commit to doing the larger refactor, and that it could be useful to more people in the community, I'd advocate to try and get this merged. But happy to just cherry as is on side if we think it should be chained behind the refactor.

@mistercrunch mistercrunch force-pushed the sqla-close branch 2 times, most recently from 380acdf to 43d37d2 Compare December 9, 2024 19:41
@mistercrunch mistercrunch marked this pull request as ready for review December 9, 2024 19:44
@sadpandajoe sadpandajoe requested a review from eschutho December 9, 2024 20:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
change:backend Requires changing the backend data:connect:postgres Related to Postgres preset-io size/M
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants