feat: temporarily disconnect metadata db during long analytics queries #31315

mistercrunch · 2024-12-05T23:31:43Z

The number of connection to the metadata database, in our case postgres, is limited, and when we have a lot of pods autoscale, at some point we can hit the maximum postgres connection, upon which point new pods/request can't get a new connection. Typically this happens when a database like Redshift or Presto is queuing up, and requests to the analytics database are hanging. People get impatient and force-refresh their dashboards, which make it even worst, piling up lots of web threads that are all just waiting for analytics db, and each one is hogging a metadata database connection.

This PR add a new feature flag (false by default) DISABLE_METADATA_DB_DURING_ANALYTICS, that uses a context manager to disconnect and reconnect to the database during long blocking operations, like waiting for an analytics query to return a result. This works only in conjunction with NullPool for now, but could be extended to work with various pool configurations.

~~Also, for convenience, adding another SIMULATE_SLOW_ANALYTICS_DATABASE feature flag that introduces a 60 seconds sleep to make it easier to test this.~~

Note that net-net this PR it's a no-op with the default feature flags.

One question is whether we scrap this PR and chain it behind a large refactor, call it "untangle and centralize all analytics database interaction in the superset/db_engine_spec/ package". Goal there would be to move all analytics database operations in one place. One issue is that database interactions are somewhat tangled with Superset-specific logic, things like interaction with the cache manager, updating progress/status in the Query table, ... I think the way we'd handle it is by passing objects like the cache_manager to db_engine_spec, and even things like callbacks if needed.

dpgaspar · 2024-12-06T11:16:50Z

superset/config.py

+    # connection temporarily during the execution of analytics queries, avoiding bottlenecks
+    "DISABLE_METADATA_DB_DURING_ANALYTICS": False,
+    # if true, we add a one minute delay to the database queries for both Explore and SQL Lab
+    "SIMULATE_SLOW_ANALYTICS_DATABASE": True,


switch to False?

dpgaspar · 2024-12-06T11:21:06Z

superset/models/core.py

-                df = mutator(df)
+            with temporarily_disconnect_db():
+                if is_feature_enabled("SIMULATE_SLOW_ANALYTICS_DATABASE"):
+                    time.sleep(30)


nit: would be nice to have a configuration key for this value also, or just remove this functionality and rely on other types of tests, for example using charts that query pg_sleep on a PG analytics db

Yeah I wanted it to be set as a value but also be dynamic (no need to deploy to change). The feature flag framework we have now only accepts book, and configs can't be changed on the fly...

I might just need to strip it out of this PR. Ideally we'd have a warm configs/settings framework isolated from feature flags.

I do think it's a bit squicky to have a production feature flag which inserts a sleep to set up some test scenario most people won't care about. I get that it makes testing easier but it'd probably be better just as a set of test instructions, especially if you can test it with a deliberately slow query like @dpgaspar suggested

I agree with @giftig - adding this type of testing logic is IMO not a great solution, as it convolutes the codebase and adds maintenance burden that doesn't contribute to the core functionality of the product.

Yeah it's funky, removing. I mean to issue this PR as DRAFT, let me switch it. Also having this in multiple places isn't great.

Also found SELECT pg_sleep(), {...} as a different/better way to test this feature.

dpgaspar · 2024-12-06T11:24:14Z

superset/models/core.py

@@ -691,27 +731,30 @@ def _log_query(sql: str) -> None:
        with self.get_raw_connection(catalog=catalog, schema=schema) as conn:


WDYT about introducing the temporarily_disconnect_db logic inside get_raw_connection, if possible, could introduce several benefits, for example would make sure that all analytics db queries would obey the disconnect, easier to refactor in the future

That was my original plan, but some code like Database.get_df() doesnt use get_raw_connection. Also looked at get_sqla_engine and other areas in DbEngineSpec. There's a fair amount of indirection and a deep, conditional call stack around analytics queries... One thing I found is that with context managers like Database.get_sqla_engine, we offer a lease on a connection and there's no guarantees that the code using it won't need a metadata connection...

It would be great to refactor all the analytics DB access to got through a single focal point. I'd say a the logic related to external data access in core.models.Database and in sql_lab.py and elsewhere should be brought in DbEngineSpec. There we can make a lot of of the logic private to that package and have the rest of the codebase use higher level abstractions like DbEngineSpec.get_df and DbEngineSpec.get_data.

WDYT about introducing the temporarily_disconnect_db logic inside get_raw_connection

Unfortunately I don't think that it is possible, given that get_raw_connection is context manager, and that it's possible for the caller to use the metadata database in the context provided by the context manager...

mistercrunch · 2024-12-06T19:12:14Z

Ok, so I removed the nasty SIMULATE_SLOW_ANALYTICS_DATABASE feature flag, and I'm thinking ideally we need to put in a significant a refactor around what I wrote before ->

One question is whether we scrap this PR and chain it behind a large refactor, call it "untangle and centralize all analytics database interaction in the superset/db_engine_spec/ package". Goal there would be to move all analytics database operations in one place. One issue is that database interactions are somewhat tangled with Superset-specific logic, things like interaction with the cache manager, updating progress/status in the Query table, ... I think the way we'd handle it is by passing objects like the cache_manager to db_engine_spec, and even things like callbacks if needed.

mistercrunch · 2024-12-06T19:16:12Z

On our side we're going to deploy this to a staging environment to run extensive tests around it. Not sure if it's mergeable as is or whether we want to keep this out of master until the larger refactor I mentioned.

Given it's a noop, that I commit to doing the larger refactor, and that it could be useful to more people in the community, I'd advocate to try and get this merged. But happy to just cherry as is on side if we think it should be chained behind the refactor.

pull-request-size bot added the size/M label Dec 5, 2024

dosubot bot added change:backend Requires changing the backend data:connect:postgres Related to Postgres labels Dec 5, 2024

mistercrunch mentioned this pull request Dec 5, 2024

feat: temporarily disconnect metadata db during long analytics queries preset-io/superset#916

Closed

github-actions bot added the preset-io label Dec 5, 2024

pull-request-size bot added size/L and removed size/M labels Dec 6, 2024

dpgaspar reviewed Dec 6, 2024

View reviewed changes

michael-s-molina requested review from betodealmeida, villebro, michael-s-molina and giftig December 6, 2024 18:20

mistercrunch marked this pull request as draft December 6, 2024 19:06

mistercrunch force-pushed the sqla-close branch 2 times, most recently from 380acdf to 43d37d2 Compare December 9, 2024 19:41

mistercrunch marked this pull request as ready for review December 9, 2024 19:44

sadpandajoe requested a review from eschutho December 9, 2024 20:02

mistercrunch added 2 commits December 9, 2024 14:48

feat: temporarily disconnect metadata db during long analytics queries

f62f3f3

no sql lab

744026b

mistercrunch force-pushed the sqla-close branch from 43d37d2 to 744026b Compare December 10, 2024 03:40

pull-request-size bot added size/M and removed size/L labels Dec 10, 2024

mistercrunch added 2 commits December 9, 2024 19:45

fix lingering

2286fcc

yo

d603818

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: temporarily disconnect metadata db during long analytics queries #31315

feat: temporarily disconnect metadata db during long analytics queries #31315

mistercrunch commented Dec 5, 2024 •

edited

Loading

dpgaspar Dec 6, 2024

dpgaspar Dec 6, 2024

mistercrunch Dec 6, 2024

giftig Dec 6, 2024

villebro Dec 6, 2024

mistercrunch Dec 6, 2024

mistercrunch Dec 6, 2024

dpgaspar Dec 6, 2024

mistercrunch Dec 6, 2024

mistercrunch Dec 6, 2024 •

edited

Loading

mistercrunch Dec 6, 2024

mistercrunch commented Dec 6, 2024 •

edited

Loading

mistercrunch commented Dec 6, 2024

		@@ -691,27 +731,30 @@ def _log_query(sql: str) -> None:
		with self.get_raw_connection(catalog=catalog, schema=schema) as conn:

feat: temporarily disconnect metadata db during long analytics queries #31315

Are you sure you want to change the base?

feat: temporarily disconnect metadata db during long analytics queries #31315

Conversation

mistercrunch commented Dec 5, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mistercrunch Dec 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mistercrunch commented Dec 6, 2024 • edited Loading

mistercrunch commented Dec 6, 2024

mistercrunch commented Dec 5, 2024 •

edited

Loading

mistercrunch Dec 6, 2024 •

edited

Loading

mistercrunch commented Dec 6, 2024 •

edited

Loading