-
Notifications
You must be signed in to change notification settings - Fork 14.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: temporarily disconnect metadata db during long analytics queries #31315
base: master
Are you sure you want to change the base?
Conversation
superset/config.py
Outdated
# connection temporarily during the execution of analytics queries, avoiding bottlenecks | ||
"DISABLE_METADATA_DB_DURING_ANALYTICS": False, | ||
# if true, we add a one minute delay to the database queries for both Explore and SQL Lab | ||
"SIMULATE_SLOW_ANALYTICS_DATABASE": True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
switch to False?
superset/models/core.py
Outdated
df = mutator(df) | ||
with temporarily_disconnect_db(): | ||
if is_feature_enabled("SIMULATE_SLOW_ANALYTICS_DATABASE"): | ||
time.sleep(30) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: would be nice to have a configuration key for this value also, or just remove this functionality and rely on other types of tests, for example using charts that query pg_sleep on a PG analytics db
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I wanted it to be set as a value but also be dynamic (no need to deploy to change). The feature flag framework we have now only accepts book, and configs can't be changed on the fly...
I might just need to strip it out of this PR. Ideally we'd have a warm configs/settings framework isolated from feature flags.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do think it's a bit squicky to have a production feature flag which inserts a sleep to set up some test scenario most people won't care about. I get that it makes testing easier but it'd probably be better just as a set of test instructions, especially if you can test it with a deliberately slow query like @dpgaspar suggested
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @giftig - adding this type of testing logic is IMO not a great solution, as it convolutes the codebase and adds maintenance burden that doesn't contribute to the core functionality of the product.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it's funky, removing. I mean to issue this PR as DRAFT, let me switch it. Also having this in multiple places isn't great.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also found SELECT pg_sleep(), {...}
as a different/better way to test this feature.
@@ -691,27 +731,30 @@ def _log_query(sql: str) -> None: | |||
with self.get_raw_connection(catalog=catalog, schema=schema) as conn: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WDYT about introducing the temporarily_disconnect_db
logic inside get_raw_connection
, if possible, could introduce several benefits, for example would make sure that all analytics db queries would obey the disconnect, easier to refactor in the future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was my original plan, but some code like Database.get_df() doesnt use get_raw_connection. Also looked at get_sqla_engine and other areas in DbEngineSpec. There's a fair amount of indirection and a deep, conditional call stack around analytics queries... One thing I found is that with context managers like Database.get_sqla_engine, we offer a lease on a connection and there's no guarantees that the code using it won't need a metadata connection...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great to refactor all the analytics DB access to got through a single focal point. I'd say a the logic related to external data access in core.models.Database
and in sql_lab.py
and elsewhere should be brought in DbEngineSpec
. There we can make a lot of of the logic private to that package and have the rest of the codebase use higher level abstractions like DbEngineSpec.get_df
and DbEngineSpec.get_data
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WDYT about introducing the temporarily_disconnect_db logic inside get_raw_connection
Unfortunately I don't think that it is possible, given that get_raw_connection
is context manager, and that it's possible for the caller to use the metadata database in the context provided by the context manager...
Ok, so I removed the nasty One question is whether we scrap this PR and chain it behind a large refactor, call it "untangle and centralize all analytics database interaction in the superset/db_engine_spec/ package". Goal there would be to move all analytics database operations in one place. One issue is that database interactions are somewhat tangled with Superset-specific logic, things like interaction with the cache manager, updating progress/status in the Query table, ... I think the way we'd handle it is by passing objects like the cache_manager to db_engine_spec, and even things like callbacks if needed. |
On our side we're going to deploy this to a staging environment to run extensive tests around it. Not sure if it's mergeable as is or whether we want to keep this out of Given it's a noop, that I commit to doing the larger refactor, and that it could be useful to more people in the community, I'd advocate to try and get this merged. But happy to just cherry as is on side if we think it should be chained behind the refactor. |
380acdf
to
43d37d2
Compare
43d37d2
to
744026b
Compare
The number of connection to the metadata database, in our case postgres, is limited, and when we have a lot of pods autoscale, at some point we can hit the maximum postgres connection, upon which point new pods/request can't get a new connection. Typically this happens when a database like Redshift or Presto is queuing up, and requests to the analytics database are hanging. People get impatient and force-refresh their dashboards, which make it even worst, piling up lots of web threads that are all just waiting for analytics db, and each one is hogging a metadata database connection.
This PR add a new feature flag (false by default)
DISABLE_METADATA_DB_DURING_ANALYTICS
, that uses a context manager to disconnect and reconnect to the database during long blocking operations, like waiting for an analytics query to return a result. This works only in conjunction with NullPool for now, but could be extended to work with various pool configurations.Also, for convenience, adding anotherSIMULATE_SLOW_ANALYTICS_DATABASE
feature flag that introduces a 60 seconds sleep to make it easier to test this.Note that net-net this PR it's a no-op with the default feature flags.
One question is whether we scrap this PR and chain it behind a large refactor, call it "untangle and centralize all analytics database interaction in the
superset/db_engine_spec/
package". Goal there would be to move all analytics database operations in one place. One issue is that database interactions are somewhat tangled with Superset-specific logic, things like interaction with the cache manager, updating progress/status in theQuery
table, ... I think the way we'd handle it is by passing objects like the cache_manager todb_engine_spec
, and even things like callbacks if needed.