-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes #4349 #4350
base: master
Are you sure you want to change the base?
Fixes #4349 #4350
Conversation
This commit fixes an issue that allowed to trigger a runtime error by passing an array of arrays to `.eq_any()`. We rewrite queries containing `IN` expressions to `= ANY()` on postgresql as that more efficient (allows caching + allows binding all values at once). That's not possible for arrays of arrays as we do not support nested arrays yet. The fix introduces an associated constant for the `SqlType` trait that tracks if a SQL type is a `Array<T>` or not. This information is then used to conditionally generate the "right" SQL. By defaulting to `false` while adding this constant we do not break existing code. We use the derive to set the right value based on the type name.
Thanks a lot for working on this! ⚡️ However regarding the approach of
which seems more appropriate when there are many values. So overall it looks like we want the one that behaves the closest to what I'll investigate further on what query plans get generated for all 3 cases tomorrow. 🙂
This seems edge-case enough (considering that unless I missed an existing issue nobody else has reported it since Diesel 2.0) that I think I would tend to not consider it a release blocker 🙂 |
So I have investigated query plans a bit more and it looks like doing |
TL;DR: argument for using So I found this comment: Which suggests that
So it seems that But my experience was contrary to this (it was in fact, dumb, and I would see n² CPU consumption profiles). So I've performed new tests (see code below), and I can find:
Maybe I should send a message to the PG mailing list pointing that as a bug because the two recheck-on-array scenarios are very inconsistent with regards to their algorithmic complexity... So I don't know what to think... If But if ANY is known-sometimes-dumb and from the PG perspective it's user's responsibility to use There is also a case for doing Tests: BEGIN;
CREATE TABLE tmp_benchmark(
id Int8 NOT NULL GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
value text NOT NULL
);
-- Insert 1M values
INSERT INTO tmp_benchmark(value) SELECT DISTINCT (floor(random() * 1000000000000)::int8)::text FROM generate_series(1,1000000);
ALTER TABLE tmp_benchmark ADD CONSTRAINT tmp_benchmark_unique_value_and_corresponding_index UNIQUE(value);
CREATE INDEX tmp_benchmark_benchmark_value_hash ON tmp_benchmark USING HASH (value);
ANALYZE tmp_benchmark;
-- Randomly sample 50k of those values
CREATE TEMPORARY TABLE lookup_test ON COMMIT DROP AS
WITH sub AS (SELECT value FROM tmp_benchmark LIMIT 10000000)
SELECT value FROM sub ORDER BY RANDOM() LIMIT 50000;
EXPLAIN ANALYZE SELECT * FROM tmp_benchmark WHERE value = ANY (SELECT value FROM lookup_test);
/* Gives:
Hash Semi Join (cost=1445.00..20996.25 rows=50000 width=20) (actual time=4.595..107.748 rows=50000 loops=1)
Hash Cond: (tmp_benchmark.value = lookup_test.value)
-> Seq Scan on tmp_benchmark (cost=0.00..16370.00 rows=1000000 width=20) (actual time=0.008..39.863 rows=1000000 loops=1)
-> Hash (cost=820.00..820.00 rows=50000 width=12) (actual time=4.566..4.567 rows=50000 loops=1)
Buckets: 65536 Batches: 1 Memory Usage: 2704kB
-> Seq Scan on lookup_test (cost=0.00..820.00 rows=50000 width=12) (actual time=0.006..1.836 rows=50000 loops=1)
Planning Time: 0.247 ms
Execution Time: 108.723 ms
*/
EXPLAIN ANALYZE SELECT * FROM tmp_benchmark WHERE value = ANY (CAST((SELECT array_agg(value) FROM lookup_test) AS text[]));
/* Gives:
Bitmap Heap Scan on tmp_benchmark (cost=985.09..1024.13 rows=10 width=20) (actual time=26.626..14440.230 rows=50000 loops=1)
Recheck Cond: (value = ANY ($0))
Rows Removed by Index Recheck: 7
Heap Blocks: exact=6369
InitPlan 1 (returns $0)
-> Aggregate (cost=945.00..945.01 rows=1 width=32) (actual time=4.606..4.607 rows=1 loops=1)
-> Seq Scan on lookup_test (cost=0.00..820.00 rows=50000 width=12) (actual time=0.007..1.998 rows=50000 loops=1)
-> Bitmap Index Scan on tmp_benchmark_benchmark_value_hash (cost=0.00..40.08 rows=10 width=0) (actual time=25.848..25.848 rows=50007 loops=1)
Index Cond: (value = ANY ($0))
Planning Time: 0.098 ms
Execution Time: 14441.520 ms
*/
-- Randomly sample 100k of those values
CREATE TEMPORARY TABLE lookup_test_2 ON COMMIT DROP AS
WITH sub AS (SELECT value FROM tmp_benchmark LIMIT 10000000)
SELECT value FROM sub ORDER BY RANDOM() LIMIT 100000;
EXPLAIN ANALYZE SELECT * FROM tmp_benchmark WHERE value = ANY (SELECT value FROM lookup_test_2);
/* Gives:
Nested Loop (cost=1555.20..3025.00 rows=78336 width=20) (actual time=15.002..101.907 rows=100000 loops=1)
-> HashAggregate (cost=1555.20..1557.20 rows=200 width=32) (actual time=14.995..26.162 rows=100000 loops=1)
Group Key: lookup_test_2.value
Batches: 5 Memory Usage: 6977kB Disk Usage: 232kB
-> Seq Scan on lookup_test_2 (cost=0.00..1359.36 rows=78336 width=32) (actual time=0.006..4.192 rows=100000 loops=1)
-> Index Scan using tmp_benchmark_benchmark_value_hash on tmp_benchmark (cost=0.00..7.88 rows=1 width=20) (actual time=0.001..0.001 rows=1 loops=100000)
Index Cond: (value = lookup_test_2.value)
Rows Removed by Index Recheck: 0
Planning Time: 0.148 ms
Execution Time: 103.924 ms
*/
EXPLAIN ANALYZE SELECT * FROM tmp_benchmark WHERE value = ANY (CAST((SELECT array_agg(value) FROM lookup_test_2) AS text[]));
/* Gives:
Bitmap Heap Scan on tmp_benchmark (cost=1595.29..1634.33 rows=10 width=20) (actual time=44.862..58360.269 rows=100000 loops=1)
Recheck Cond: (value = ANY ($0))
Rows Removed by Index Recheck: 16
Heap Blocks: exact=6370
InitPlan 1 (returns $0)
-> Aggregate (cost=1555.20..1555.21 rows=1 width=32) (actual time=9.109..9.110 rows=1 loops=1)
-> Seq Scan on lookup_test_2 (cost=0.00..1359.36 rows=78336 width=32) (actual time=0.007..4.002 rows=100000 loops=1)
-> Bitmap Index Scan on tmp_benchmark_benchmark_value_hash (cost=0.00..40.08 rows=10 width=0) (actual time=44.242..44.243 rows=100018 loops=1)
Index Cond: (value = ANY ($0))
Planning Time: 0.089 ms
Execution Time: 58364.590 ms
*/
-- Note how with 2x the number of values, execution time is x4 the above. Monitoring CPU while this query runs shows 100% CPU, unlike the subselect-based variant.
DROP INDEX tmp_benchmark_benchmark_value_hash;
-- Once we remove the hash index, PG is not stupid anymore, even with the same 100k sample and bitmap heap scan plan it doesn't do n² somehow
-- On the same 100k sample:
EXPLAIN ANALYZE SELECT * FROM tmp_benchmark WHERE value = ANY (CAST((SELECT array_agg(value) FROM lookup_test_2) AS text[]));
/* Gives:
Bitmap Heap Scan on tmp_benchmark (cost=1599.54..1638.58 rows=10 width=20) (actual time=236.604..252.188 rows=100000 loops=1)
Recheck Cond: (value = ANY ($0))
Heap Blocks: exact=6370
InitPlan 1 (returns $0)
-> Aggregate (cost=1555.20..1555.21 rows=1 width=32) (actual time=9.091..9.093 rows=1 loops=1)
-> Seq Scan on lookup_test_2 (cost=0.00..1359.36 rows=78336 width=32) (actual time=0.009..3.946 rows=100000 loops=1)
-> Bitmap Index Scan on tmp_benchmark_unique_value_and_corresponding_index (cost=0.00..44.33 rows=10 width=0) (actual time=236.169..236.169 rows=100000 loops=1)
Index Cond: (value = ANY ($0))
Planning Time: 0.161 ms
Execution Time: 254.145 ms
*/
-- So despite doing a bitmap heap scan on the same 100k elements array as above, this one seemingly doesn't n².
ROLLBACK; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realized I also have examples in mind where = ANY(array)
performs better than = ANY(subquery)
because it manages to push a filter further into the query plan in that case, so I think that ends up convincing me that we just want to map to SQL as close as possible, so that we should do IN([1],[2],...)
.
The only thing is we should probably document it, considering that sometimes = ANY(...)
manages to not do n², and I suspect IN([1],[2],...)
may not be able to be as efficient in similar scenarios since it seems to expand to just many OR
clauses...
let connection: &mut PgConnection = &mut connection(); | ||
let tag_combinations_to_look_for: &[&[&str]] = &[&["foo"], &["foo", "bar"], &["baz"]]; | ||
let result: Vec<i32> = posts | ||
.filter(tags.eq_any(tag_combinations_to_look_for)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we probably want to document this behavior in these places:
diesel/diesel/src/expression/array_comparison.rs
Lines 31 to 32 in 5a2940e
/// The postgres backend provided a specialized implementation | |
/// by using `left = ANY(values)` as optimized variant instead. |
diesel/diesel/src/expression/array_comparison.rs
Lines 51 to 52 in 5a2940e
/// The postgres backend provided a specialized implementation | |
/// by using `left = ALL(values)` as optimized variant instead. |
diesel/diesel/src/expression_methods/global_expression_methods.rs
Lines 109 to 110 in 5a2940e
/// On PostgreSQL, this method automatically performs a `= ANY()` | |
/// query. |
if ST::IS_ARRAY { | ||
self.walk_ansi_ast(out) | ||
} else { | ||
out.push_bind_param::<Array<ST>, Vec<I>>(&self.values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this got me thinking (long-term changes thoughts): maybe QueryId
should be generic on the backend as well, so that we can properly express that this variant could in fact use static query id. If we added the option for HAS_STATIC_QUERY_ID = false
to the QueryId
derive maybe it wouldn't be that verbose in general.
But maybe that's not worth. In any case that would be a separate change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems like an reasonable idea for the long run.
Although I remember that the benchmarks have shown at some point that the cost of query construction for looking up statements in the prepared statement cache is only relevant for SQLite. It's not relevant for postgresql due to the network roundtrips.
For completeness: while the "reproducer" above is indeed n² with expressions like
|
This commit fixes an issue that allowed to trigger a runtime error by
passing an array of arrays to
.eq_any()
. We rewrite queries containingIN
expressions to= ANY()
on postgresql as that moreefficient (allows caching + allows binding all values at once). That's
not possible for arrays of arrays as we do not support nested arrays
yet.
The fix introduces an associated constant for the
SqlType
trait thattracks if a SQL type is a
Array<T>
or not. This information is thenused to conditionally generate the "right" SQL. By defaulting to
false
while adding this constant we do not break existing code. We use the
derive to set the right value based on the type name.
cc @Ten0
(I would suggest to include that PR in the patch release as well)